13 



Docket No. ARS.U3 
Serial No. 10/540,845 



Remarks 

Claims 57-82 are pending in the subject application. By this Amendment, Applicants have 
amended claims 62 and 73 to correct inadvertent errors. Entry and consideration of the amendment 
presented herein is respectfully requested. Accordingly, claims 57-82 are currently before the 
Examiner. Favorable consideration of the pending claims is respectfully requested. 

As an initial matter, replacement Figures 3 and 23 are submitted herewith. Replacement 
figures were necessary in order to remove any indication of color. No new matter has been added by 
these amendments. Entry of the replacement drawings is respectfully requested. 

The application is objected to on the grounds that the subject specification fails to comply 
with the requirements of 3 7 CFR 1 .82 1 (a)( 1 ) and (a)(2). Specifically, the Examiner indicates that a 
new sequence listing is required because the sequences in Figures 1-6, 13-15 and 22-24 are not 
identified by a sequence identifier number. The sequences shown in Tables III, IV and V of the 
subject specification have also been designated with a sequence identifier number. A Submission of 
Sequence Listing Under § 1 .82 1 , including a replacement sequence listing on paper and a computer 
readable format, is attached. Accordingly, reconsideration and withdrawal of the objection is 
respectfully requested. 

The disclosure is objected to because it contained embedded hyperlinks or other forms of 
browser executable code. Applicants respectfully submit that this issue is moot in view of the 
amendments made to the specification. Accordingly, reconsideration and withdrawal of the 
objection is respectfully requested. 

Claims 57-82 are rejected under 35 U.S.C. § 101 on the grounds that the claimed invention 
lacks a substantial utility. In addition, claims 57-82 are rejected under 35 U.S.C. §112, first 
paragraph, as nonenabled on the grounds that the subject specification fails to teach a substantial 
utility for the claimed invention and, therefore, an ordinarily skilled artisan would not know how to 
use the claimed invention. The Office Action argues that the specification states that the invention is 
based upon the identification of an Open Reading Frame (ORF) in the human genome encoding a 
novel Preadipocyte factor- 1 -like polypeptide (referred to as SCS0009 and other splice variants. The 
Office Action further argues that the asserted utilities of the claimed polypeptides have been based 



J:\ARS\113W.mend-Resp\Amd.doortJNB/jb 



14 



Docket No. ARS- 11 3 
Serial No. 10/540,845 



upon domain organization and that the asserted functions of the polypeptides are merely hypothetical 
and based upon domain homology. The Office Action fiirther argues that function cannot be 
predicted based upon structural similarity to a protein in the sequence databases, citing to Skolnick et 
al.. Trends Biotechnol. (2000) 18:34-39. Applicants respectfully assert that the claimed invention 
has substantial utility and, tiierefore, is enabled and traverse this rejection. 

The examiner bears the initial burden of showing that a claimed invention lacks patentable 
utility. See In re Brana, 51 F.3d 1560, 1566, 34 USPQ2d 1436, 1441 (Fed. Cir. 1995) ("Only after 
the PTO provides evidence showing that one of ordinary skill in the art would reasonably doubt the 
asserted utility does the burden shifi: to the applicant to provide rebuttal evidence sufficient to 
convince such a person of the invention's asserted utility."). The Patent Office must also articulate 
the factual assumptions and provide evidentiary support relied upon in establishing the prima facie 
showing. Applicants also note that compliance with the utility requirement of 35 U.S.C. § 101 and 
the enablement requirement of 35 U.S.C. § 11 2, first par^raph, does not turn on whether an example 
is disclosed (see M.P.E.P. §2 1 64.02). Indeed, the lack of working examples or methods that indicate 
that the claimed polypeptide is involved in any of the asserted activities, standing alone, carmot be 
the basis for a rejection under 35 U.S.C. 101 or35U.S.C. 112. Tol-O-Matic, Inc. v. Proma Produkt- 
UndMktg. GeseUschaft m.b.k, 945 F.2d 1546, 1553, 20 USPQ2d 1332, 1338 (Fed. Cir. 1991). 

Applicants also note that the as-filed specification indicates (at page 5, lines 16-26): 

The novel polypeptide described herein was identified on the basis of a consensus 
sequence for human Preadipocyte factor-1 -like polypeptides in which the number 
and the positioning of selected amino acids are defined for a protein sequence having 
a length comparable to known Preadipocyte factor-1 -like polypeptides. 

The totality of amino acid sequences obtained by translating the known ORFs in the 
human genome were challenged using this consensus sequence, and the positive hits 
were further screened for the presence of predicted specific structural and functional 
"signatures" that are distinctive of a polypeptide of this nature, and finally selected by 
comparing sequence features with known Preadipocyte factor- 1 -like polypeptides. 
Therefore, the novel polypeptides of the invention can be predicted to have 
Preadipocyte factor-l-like activities. 

Furthermore, the parameters used to identify the sequences and pertinent domains are provided in 

J:\ARS\1 13\Amend-Resp\Amd,doQ'DNB/jb 



15 



DocketNo.ARS.il 3 
Serial No. 10/540,845 



Examples 1 and 4 and the Office Action fails to establish any reason one skilled in the art would 
doubt the asserted utilities of the claimed polypeptides or the functions assigned to the claimed 
polypeptides on the basis of the information disclosed in the as-filed application. 

Applicants now turn to Patent Office's arguments that the teaching of Skobiick et al. support 
a finding that the claimed invention lacks patentable utility. With regard to the teachings of that 
reference, Applicants submit that its teachings are not as absolute as argued in the Office Action, 
Applicants submit, herewith, a Journal of Molecular Biology article (Wilson et al, 2000, J. Mol. 
Biol. , Vol. 297, pp. 233-249) published in the same timefi-ame as Skolnick et al. As opposed to the 
broad and sweeping generalizations foimd in Skolnick et al , Wilson et al specifically analyzed and 
addressed the degree to which structural annotation can be transferred between sequences at a given 
level of sequence similarity (see Practical Implications and Figure 7, pages 245-247). As noted by 
Wilson et al, if a sequence matches a protein database structure with an e- value of 0.001 and a 
percent identity of 30%, the polypeptide is virtually certain to have the same fold, the polypeptide 
has 66% likelihood of having the same exact function, and that the proteins have about a 99% chance 
of having the same functional class. In this case, the claimed polypeptide has 34% identity, 46% 
similarity and an e- value of 2e-47 (see attached alignment for SEQ ID NO: 2 and PREF-1 (SEQ ID 
NO: 38)). Thus, Wilson et al would indicate that it is more likely than not that the two polypeptides 
would have the same or similar fimction and it is respectfully submitted that those skilled in the art 
would not have had a reason to doubt the asserted utility of the claimed invention. Thus, it is 
respectfully submitted that the claimed invention has a specific, credible and substantial utility and 
that one skilled in the art would know how to use the claimed invention in view of the teachings of 
the specification. Accordingly, reconsideration and withdrawal of the rejections is respectfully 
requested. 

Claims 57, 64-69 and 77-82 are rejected under 35 U.S.C. § 1 12, first paragraph, as containing 
subject matter which was not described in the specification in such a way as to reasonably convey to 
one skilled in the relevant art that the inventors, at the time the application was filed, had possession 
of the claimed invention. Applicants respectfully assert that there is adequate written description in 
the subject specification to convey to the ordinarily skilled artisan that they had possession of the 

J:\ARS\1 !3\Amend-Resp\Amd.do<yDNB/jb 



16 



DocketNo.ARS.il 3 
Serial No. 10/540,845 



claimed invention. The Office Action argues that the as-filed specification fails to provide adequate 
written description and evidence of possession for the claimed genus of variants. The Office Action 
also notes that the only factor present in the claims is the recitation of an activity ("prevents the 
terminal differentiation of preadipocytes") and that there is no identification of any portion of the 
structure that must be conserved for the required activity. The Office Action further argues that it is 
unclear as to w^hat molecules are within the genus of "active variants" as the specification does not 
provide a complete or partial structure and fails to provide a representative number of species for the 
recited gemas. Applicants traverse the rejection. 

At the outset, Applicants note that the clauns are directed to an isolated polypeptide 
comprising an active variant of SEQ ID NO: 2, SEQ ID NO: 3 , SEQ ID NO: 4, SEQ ID NO: 8, SEQ 
ID NO: 9 or SEQ ID NO: 10, wherein any amino acid specified in the sequence is non- 
conservatively substituted, provided that no more than 15% of the amino acid residues are 
substituted and said active variant prevents the terminal differentiation of preadipoc3^s. Thus, and 
contrary to the assertion in the Office Action, the claims provide at least a partial structure of the 
claimed polypeptide variants; namely, any one of SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, 
SEQ ID NO: 8, SEQ ID NO: 9 or SEQ ID NO: 10 in which no more than 15% of the amino acid 
residues are substituted and wherein the polypeptide retains the ability to prevent the terminal 
differentiation of preadipocytes. 

Turning to the argument that the as-filed specification fails to identify the portion of the 
molecule that is associated with the ability to prevent the terminal differentiation of preadipocytes. 
Applicants respectfully submit that the domain of PREF-1 associated with that activity was known to 
those skilled in the art (see, for example, Smas et al {Mol. Cell. Biol, 1997, Vol. 17, pp. 977-988, a 
copy of which is attached). As discussed in that publication, the ability to prevent terminal 
differentiation of preadipocytes is associated with the N-terminal region of the polypeptide and 
soluble PREF-1 containing that domain inhibits preadipocytes differentiation (see pages 981-983). 
As the Patent Office is aware, the Federal Circuit has made clear that the specification need not 
describe every permutation of an invention nor subject matter known to those of skill in the art. 
Capon V. Eshhar, 418 F.3d 1349,1359-60 (Fed. Cir. 2005). Moreover, an adequate written 



J:\ARS\1 1 3\Amend-Resp\Amd.doQ'DlMB/jb 



17 



DocketNo.ARS.il 3 
Serial No. 10/540,845 



description of an invention that involves biological macromolecules need not contain a recitation of 
each known stracture, particularly when those structures are already known in the art. Falkner v. 
Inglis, 448 F.3d 1357, 1366 (Fed. Cir. 2006) ("the forced recitation of known sequences in patent 
disclosures would only add unnecessary bulk to the specification. Accordingly, we hold that where . 
. . accessible literature sources clearly provided, as of the relevant date, [the sequences], satisfaction 
of the written description requirement does not require either the recitation or incorporation by 
reference"). Accordingly, it is respectfully submitted that the as-filed specification and currently 
pending claims fully comply with the written description requirement and reconsideration and 
withdrawal of the rejection under 35 U.S.C. § 1 12, first paragraph, is respectfully requested. 

It should be understood that the amendments presented herein have been made solely to 
expedite prosecution of the subject application to completion and should not be construed as an 
indication of Applicants' agreement with or acquiescence in the Examiner's position. Applicants 
expressly reserve the right to pursue the invention(s) disclosed in the subject application, including 
any subject matter canceled or not pursued during prosecution of the subject ^plication, in a related 
application. 

In view of the foregoing remarks and amendments to the claims. Applicants believe that the 
currently pending claims are in condition for allowance, and such action is respectfully requested. 

The Commissioner is hereby authorized to charge any fees under 37 CFR § § 1 . 1 6 or 1 . 1 7 as 
required by this paper to Deposit Account No. 19-0065. 



J:\ARS\1 1 3\Amend-Resp\Amd.dooT>NB/jb 



18 



Docket No. ARS.llS 
Serial No. 10/540,845 



Applicants invite the Examiner to call the undersigned if clarification is needed on any of this 
response, or if the Examiner believes a telephonic interview would expedite the prosecution of the 
subject application to completion. 



Respectfully submitted, 

Frank C. Eisenschei 
Patent Attorney 
Registration No. 45,332 
Phone No.: 352-375-8100 
Fax No.: 352-372-5800 
Address: P.O. Box 142950 

Gainesville, FL 32614-2950 

FCE/jb/sl 

Attachments: Annotated and Replacement Figures 3 and 23 
Submission of Sequence Listing and Statement 
New pages 1-37 (Sequence Listing) 
Wilson et al, 2000, J. Mol Biol, Vol. 297, pp. 233-249 
Alignment for SEQ ID NO: 2 and PREF-1 (SEQ ID NO: 38) 
Smas et al, Mol Cell Biol, 1997, Vol. pp. 977-988 




J:\ARS\i 1 3\Amend-Rcsp\Amd.doo'DNB/jb 



doi:10.1006/imbi.2000.3550 available online at http://www.idealibrary.com on liEJ^rj. Mol. Biol. (2000) 297, 233-249 



JMB 



Assessing Annotation Transfer for Genomics: 
Quantifying tlie Relations between Protein Sequence, 
Structure and Function through Traditional and 
Probabilistic Scores 

Cyrus A. Wilson% Julia Kreychman^ and Mark Gerstein^ ^** 



^Department of Molecular 
Biophysics and Biochemistry 

'^Department of Computer 
Scimce, Yale University, 266 
Whitney Avenue, PO Box 
208114, New Haven, CT 
06520, USA 



*Corresponding author 



Measuring in a quantitative, statistical sense the degree to which struc- 
tural and functional infomiation can be "transferred" between pairs of 
related protein sequences at various levels of similarity is an essential 
prerequisite for robust genome annotation. To this end, we performed 
pairwise sequence, structure and function comparisons on ~30,000 pairs 
of protein domains with known structure and function. Our domain 
pairs, which are constructed according to the SCOP fold classification, 
range in similarity from just sharing a fold, to being nearly identical. Our 
results show that traditional scores for sequence and structure similarity 
have the same basic exponential relationship as observed previously, 
with structural divergence, measured in RMS, beiiig exponentially related 
to sequence divergence, measured in percent identity. However, as the 
scale of our survey is much larger than any previous investigations, our 
results have greater statistical weight and precision. We have been able 
to express the relationship of sequence and structure similarity using 
more "modem scores," such as Smith-Waterman alignment scores and 
probabilistic P-values for both sequence and structure comparison. These 
modem scores address some of the problems with traditional scores, 
such as determining a conserved core and correcting for length depen- 
dency; they enable us to phrase the sequence-structure relationship in 
more precise and accurate terms. We found that the basic exponential 
sequence-structure relationship is very general: the same essential 
relationship is found in the different secondary-structure classes and is 
evident in all the scoring schemes. To relate function to sequence and 
structure we assigned various levels of functional similarity to the 
domain pairs, based on a simple functional classification scheme. This 
scheme was constructed by combining and augmenting annotations in 
the en25nne and fly functional classifications and comparing subsets of 
these to the Escherichia coli and yeast classifications. We found sigmoidal 
relationships between similarity in function and sequence, with clear 
thresholds for different levels of functional conservation. For pairs of 
domains that share the same fold, precise function appears to be con- 
served down to ~40% sequence identity, whereas broad functional class 
is conscivcd to "-25 %>. lT:terestingly, percent identity is more effective at 
quantifying functional conservation than the more modem scores (e.g. P- 
values). Results of all the pairwise comparisons and our combined func- 
tional classification scheme for protein structures can be accessed from a 
web database at http://bioinfo.mbb.yale.edu/align 

© 2000 Academic Press 

Keywords: bioinformatics; sequence similarity; percent identity; structure 
similarity; functional classification 



Abbreviations used: EC, Enzyme Commission; EST, expressed sequence tags; SCOP, structural classification of 
proteins; GO, Gene Ontology Project, 
E-mail address of the corresponding author: Mark.Gerstein@yale.edu 



0022-2836/00/010233-17 $35.00/0 



© 2000 Academic Press 



234 



Assessing Annotation Transfer for Genomics 



Introduction 

The problem of genome annotation 

Perhaps the most valuable information to be 
gained from a genome analysis is functional anno- 
tation of all the gene products. Unfortunately, of 
all the proteins whose sequences are known, func- 
tions have been experimentally determined for 
only a very small number (Andrade & Sander, 

1997) . Given the current size and accessibility of 
sequence and structure data, homologs of a newly 
sequenced gene's product can be identified via 
database searches, and probable structure and 
function assigned to the gene product {Bork et ah, 

1998) . This is based on the concept that sequence 
similarity implies structural and functional simi- 
larity. However, structural and functional annota- 
tions should be transferred with caution. If a 
protein is assigned an incorrect function in a data- 
base, the error could carry over to other proteins 
for which structure or function is inferred by hom- 
ology to the errant protein (Brenner, 1999; Karp, 
1996, 1998a). In large databases such an error can 
propagate out of control, presenting a serious qual- 
ity control issue as we move to larger genomes 
from multicellular organisms. 

Benchmarking fold and function recognition 

Here, we used manually curated structural and 
functional classifications as standards in analyzing 
to what degree annotations of a protein's structure 
and fmction can be transferred to a similar 
sequence. The knowledge gained from the study 
can be used to establish confidence levels for struc- 
ture and function prediction, improving our under- 
standing of how long it will take to annotate 
accurately an entire genome. 

Our simultaneous analysis of relationships 
between sequence and structure, sequence and 
function, and structure and function (Figure 1) 
may provide insight into paradigms for functional 
prediction other than that based alone on sequence 
similarity (Enright et al, 1999). 

Past results 

Sequence-structure 

The trar^fer of structural annotation is well 
characterized. Chothia & Lesk (1986, 1987) found 
that structural divergence, when expressed in 
terms of the RMS separation of matching alpha 
carbon atoms, was an exponential function of 
sequence divergence, expressed in terms of the 
fraction of residues that differed between 
sequences. The reliability of structural annotation 
transferred by homology, then, depends on the 
sequence identity of the homologous proteins 
(Chothia & Lesk, 1986). Flores et al. (1993), Russell 
& Barton (1994), and Russell et al. (1997) observed 
the same general trend, and also characterized ttie 
conservation of structural features other than the 



C" backbone, such as secondary structure, accessi- 
bility and torsion angles. A paper by Wood & 
Pearson (1999) re-expressed the sequence-structure 
relationship in terms of statistically based "Z- 
scores" and found that this relationship had a 
simple linear form in terms of these scores. They 
also noted that protein families differed in detail in 
the slope of this linear relationship. 

Others have focused on the limits of sequence 
comparison, specifically around the "twilight 
zone," the region of sequence similarity that does 
not reliably imply structural homology (Doohttle, 
1987), and on establishing cut-offs for significant 
sequence similarity. Using the ^OP structural 
classification (Murzin et al, 1995), Brenner et al. 
(1998) benchmarked the effectiveness of the popu- 
lar FASTA and BLASTP programs and their prob- 
abilistic scoring schemes (i.e. the e-value) (Pearson 
& Lipman, 1988; Pearson, 1996; Altschul et aU 
1990, 1994; Karlin & Altschul, 1993). They found 
that in making fold assignments, the FASTA 
e-value closely tracked the number of false posi- 
tives, i.e. the error rate, and that at a conservative 
e-value cut-off of 0.001, the FASTA program could 
detect nearly aU the relationships that would be 
detected by a full Smith-Waterman comparison 
(Smith & Waterman, 1981). Specifically, they found 
that FASTA with a 0.001 tiureshoM would find 
16% more of the structural relationships in SCOP 
than would be foimd by standard sequence com- 
parison with a 40% identity threshold. This rigor- 
ous benchmarking approach has been extended to 
assess transitive sequence comparison, through a 
third intermediate sequence and multiple-sequence 
matching programs such as PSI-blast (Park et al, 
1997, 1998; Gerstein, 1998a; Salamov et al, 1999). In 
a related study Rost (1999) worked on characteriz- 
ing the region after the twilight zone, which he 
called the "midrught zone". In a sense these bench- 
marking studies have culminated in the CASP fold 
recognition experiments (Moult et al, 1997; 
Sternberg et al, 1999). 



Sequeryce-function 

Although the exact dependence of fimctional 
similarity on sequence and structural similarity is 
not completely clear, initial indications of a gene 
product's function are most often based on simple 
sequence similarity (Bork et al. 1994, 1998). Often 
these are merely based on the best hit in database 
comparisons; see, for example, the armotation of 
some of the early genomes (Fraser et al, 1995, 
1998). However, possibilities for more robust anno- 
tation transfer are increasingly available. One looks 
at the pattern of hits amongst different phylo- 
genetic groups (Tatusov et al, 1997). Often these 
focus on the existence of key motifs and patterns 
associated with ftmction (Zhang et al, 1998; Bork & 
Koonin, 1996; Attwood et al, 1999). 



Assessing Annotation Transfer for Genomics 



235 



(A) (B) 




Figure 1. This Figure schematically depicts certain aspecte of our comparison methodology, (a) The paradigm relat- 
ing sequence to structure to function. There has not been as much assessment of functional annotation transfer based 
on structure as there has been with sequence-based structural and functional annotation transfer, (b) How we concep- 
tualized our analysis in terms of pairs. A few examples of SCOP domains (identified on the left and bottom) are 
included from our comparison. Ih the Figure the shape represents fold, and the pattern represents function. We have 
highlighted some example categories of pairs: a pair that shares fold and function, a pair that shares fold but not 
function and a pair that shares neither fold nor function. The latter category of pairs is not considered in our investi- 
gation; we looked only at paired domains with the same fold. In constructing our pairs, we used only a representa- 
tive set of SCOP domains. This is illustrated in the Figure by the domains flagged wi th asterisks. Note, in particular, 
that the SCOP domain d4tima__ is not paired with anything because it is represented by d5tima_, which is the same 
species and protein. For each level of pairs (fold, superfamily, family), cluster representatives were chosai for the 
level below: (i) for family pairs, one representative was selected from each species/protein, the level below, and then 
paired with all the other representatives within its family; (ii) for superfamily pairs, one representative was chosen 
from each family, unless there were domains in the family that shared less than 40% sequence identity, in which case 
additional representatives were included, each not more than 40 % identical with the other representatives from the 
family (this occurs, for instance, for the globins); and (iii) likewise for fold pairs, one representative was chosen from 
each superfamily, more if there were domains with less than 40% sequence identity, (c) Subdivides the pairs into the 
four SCOP classes from which they were composed: (i) all-a, domains consisting of a-helices; (ii) all-P, domains con- 
sisting of {5-sheets; (iii) ot/p, domains with integrated a-helices and ft-strands; and (iv) « + P, domains with segregated 
a-helices and |3-strands. We initially set apart the immunoglobulins from the rest of the all-f) pairs because we rea- 
lized that their large number biases our data. However, we compared the results for the immunoglobulin pairs to all 
other pairs and found that they generally exhibit the same behavior as the other pairs. Therefore we decided to leave 
them in the comparison. 



Sequence-structure-function 

One way that the better-defined sequence-struc- 
ture relationship can assist in function prediction is 
initially to predict the structure of an uncharacter- 
ized sequence and then predict the function based 
on the limited repertoire of functions knovm to 
occur with that structure. To some degree this was 
achieved by Fetrow and co-workers (Fetrow et al., 
1998; Fetrow & Skolnick, 1998). They predicted 
Structural profiles based on threading and ab initio 
methods, and then searched with these against 
profiles of known structures in order to predict 
function. 

In related work, Russell et al. (1998) discussed 
using identification of structural binding sites in 



predicting protein function. In a comprehensive 
study, Hegja & Gerstein (1999) investigated to 
what degree folds were associated witii functions. 
They fovind that most folds were associated with 
one or two functions with the e.Kception of a few 
special folds, such as the TIM barrel, that could 
carry out numerous functions. Furthermore, they 
found that particular folds were often confined to 
distinct phylogenetic groups, an additional fact 
that can feed into an integrated sequence-structure- 
function analysis (Gerstein & Hegyi, 1998; 
Gerstein, 1997, 1998b,c). 

Here, we look at pairwise comparisons of 
protein sequence, structure and function among 
proteins that share the same fold. We assess the 



236 



Assessing AnnotaUon Transfer for Genomics 



trends relating sequence, stnicture and function 
and consider the implications for structural and 
functional annotation transfer. 

New developments: probabilistic scoring and 
growth of tlie databank 

The past studies regarding sequence, structure 
and function relationships often used RMS separ- 
ation and percent sequence identity (or a linear 
variant of it, such as the fraction of mutated resi- 
dues) to express similarities in structure and in 
sequence, respectively. However, it has become 
increasingly common to use probabilistic scoring 
schemes (P-values) to express the quality of a 
match in terms of statistical significance rather 
than an arbitrary raw score such as percent iden- 
tity (Pearson, 1998; Karlin & Altschul, 1990, 1993; 
Karlin et at. 1991; Altschul et al. 1994; Bryant & 
Altschul, 1995; Abagyan & Batalyov, 1997). With 
P-values, scores from different investigations can 
be compared in a common framework. Recently, it 
was found that sequence and structure similarity 
significance can be expressed as P-values in the 
same unified statistical framework (Levitt & 
Gerstein, 1998). Here, we use such probabilistic 
scoring methods to overcome the limitations of the 
more traditional scores. 

Another recent development is the tremendous 
growth in the number of solved structures. The 
RCSB Protein Data Bank (Bernstein et al. 1977) now 
contains more than 10,000 protein structures. These 
structures are broken into more than 18,000 
domains, and then domains that share a fold are 
paired up vdth each other for comparison 
(Figure lO?))- Here, we survey ~30,000 pairs of 
protein domains that are known to have the same 
fold, approximately 1000 times the number com- 
pared by Chothia & Lesk (1986). The large scale of 
this comparison affords greater statistical weight to 
the results. 

Alignment of 30,000 pairs from SCOP 

Ttie basic unit of comparison: a pair of 
protein domains 

The protein domains that we studied were classi- 
fied by SCOP, a Structural Classification of Pro- 
teins (Murzin et al. 1995; Brenner et al. 1996; 
Hubbard et al. 1997), a hierarchy of five levels: 

(i) class, domains that have the same secondary 
structural content (all-a, aU-p, o:/p, or a -|- P); 

(ii) fold, domains tliat geometrically share the same 
tertiary fold; (iii) superfamily, domains descended 
from the same ancestor (but which lack measurable 
sequence similarity); (iv) family, domains in the 
same protein sequence family (which have appreci- 
able sequence similarity); and (v) species and 
protein. 

Pates of protein domains that are grouped 
together at tihe fold, superfamily or family level 
form the basic unit of our comparisons. 



Selection of pairs 

There is potentially a huge number of paics of 
domains that can be constructed out of the 
relationships in SCOP. For instance, in the current 
version of SCOP there are ~3.9 million potential 
pairs between domains sharing the same fold. 
Most of these are between nearly identical struc- 
tures. In order to keep the nimiber of pairs man- 
ageable, we us&d a straightforward clustering 
scheme, described in the legend to Figure 1. We 
selected 29,454 representative pairs from the total 
in SCOP. To achieve a wide range of similarities, 
we constructed the pairs on three levels of the 
SCOP hierarchy: (i) family pairs, 19,542 pairs of 
domains in the same family; (ii) superfamily pairs, 
4220 pairs of domains in the same superfamily 
but different families; and (iii) fold pairs, 5692 
pairs of domains in the same fold but different 
superfamilies. 

All the selected domains were at least 50 resi- 
dues in length and were drawn from the four 
major SCOP secondary-structural classes: all-a, all- 
P, a/p, and a + P (Figure 1(c)). 

We automatically aligned each of our selected 
domain pairs twice, once by global Needleman- 
Wunsch sequence comparison (Needleman & 
Wunsch, 1971; Myers & Miller, 1998) and then 
by structure (Gerstein & Levitt, 1996, 1998), cal- 
culating scores for sequence and structural simi- 
larity. 

Web-accessible database 

The results of all the pairwise comparisons are 
available via a searchable database on the web at 
http://bioinfo.mbb.yale.edu/align The query 
engine allows searches of individual SCOP pairs, 
all pairs that include a given SCOP domain, or all 
pairs containing any SCOP domain contained in a 
given PDB entry. 

Traditional scores: RMS and percent identity 

The sequence-structure relation, as expressed by 
the root-mean-square (RMS) of the aligned C" dis- 
tances and percent sequence identity, has been pre- 
viously characterized as an exponential function by 
Chothia & Lesk (1986) and others (Flores et al 
1993; Russell & Barton, 1994; Russell et al. 1997). 
As Figure 2 illustrates, our data display a similar 
trend. (Exact equations are given in the legend to 
Figure 2.) However, we have one thousand times 
as many data points as in Chothia and Lesk's orig- 
inal study (30,000 as opposed to 30). 

The main difference between our results £ind 
the previous studies is due to differences in 
RMS "trimming" methods. By trimming we refer 
to the process of removing the worst-fitting 
aligned atoms from the RMS calculation, to 
arrive at a structural "core." This was first 
developed in Lesk's sieve-fit procedure (Lesk & 
Chothia, 1984) and has been refined in numer- 



Assessing Annotation Transfer for Genomics 



237 



ous studies (e.g. Gerstein & Altaian (1995)). This 
is done because the small distances between 
well-matched alpha carbon atoms have much 
less of an effect on the RMS than do the very 
large distances between poorly matched atoms. 
The untrimmed score of divergent protein 
domains is then concerned primarily with the 
poorly matched residues instead of the con- 
served core. Trimming alleviates this effect by 
restricting the RMS calculation to include only 
those residues believed to be in the conserved 
core. However, tlie degree of trimming is to 
some extent arbitrary, and tfiis choice affects the 
baseline of the reported RMS scores. Here we 
considered only the better half (50%) of matched 
residues in a given pair of protein domains. 
Chothia & Lesk (1986) chose a somewhat differ- 
ent threshold. Figure 2(c) and (d) demonstrate 
the effect of trimming. 



Analogous alignment similarity scores: Smith- 
Waterman score and structural 
comparison score 

The dependence of the RMS separation on trim- 
ming method restricts its usefulness in comparing 
data. Likewise, there are many problems with 
using percent identity as a measure of sequence 
similarity. For instance, a match of non-identical 
but still similar residues (e.g. Arg versus Lys) scores 
the same as one between completely different resi- 
dues (e.g. Arg versus Val), and gaps do not enter in 
the score Ccdculation. Consequently, we now turn 
to alignment similarity scores, which eliminate 
some of the problems with traditional scores. 

For sequertte alignments, an alignment score is 
defined as the sum of the similarity matrix values 
for the alignment, minus the total gap penalty. 
This is sometimes called the Smith-Watennan score 
(Smith & Waterman, 1981). An analogous align- 
ment score for structure is the structural compari- 
son score, described by Levitt & Gerstein (1998). 
We will refer to these two similarity scores as S^^^ 
and S^tr/ respectively. Note that they both increase 
for more similar pairs, whereas RMS increases for 
more divergent pairs. Specifically, S^^^ is the score 
maximized by tihe structural alignment program 
we used (Gerstein & Levitt, 1998). It can be calcu- 
lated from any pair of aligned structures according 
to the function; 



=my: 



\ 



her" 



M and Uq are constants, usually set to 10 and 5 A, 
Ng^p is the number of gaps in the alignment, d-, is 
the distance between each aligned pair of C" 
atoms, and the sum is carried over all aligned 
pairs, i. 



The main advantage of S^^ over RMS in describ- 
ing structural similarity is that the C to C 
distance, d,, appears in liie denominator of the cal- 
culation. This means that the smallest distances, 
corresponding to the best matches in the conserved 
core, are most significant in determining the score. 
Hence, the need for trimming is eliminated. S^^ is 
also advantageous because it takes gaps into 
account and because of the fundamental analogy 
between this score and S^. 

Figure 3(a) displays the relationship between 
structural and sequence similarity as expressed by 
Sgtr and Sggq. Figure 3(c) and (d) show calibration 
curves relating each of these scores back to 
approximate RMS separation and percent identity, 
respectively. Calibration curves help one get an 
intuitive feel for the degree of relationship in terms 
of the more traditional scores. Figure 3(b) adds a 
third axis, alignment length, and demonstrates that 
Sstr depends greatly on lids quantity. Although S^tf 
and Sggq are "better" scores than RMS and percent 
sequence identity, the heavy dependence of both of 
these on length limits tlieir usefulness in many 
situations. In other words, two pairs of similar 
domains with equal percent sequence identities but 
different lengths can have drastically different 5^^^ 



Probabilistic scores: P-values expressing the 
significance of sequence and 
structure similarity 

Probabilistic scores can, to a great degree, over- 
come the length-dependence problems associated 
with the alignment scores. Probabilistic measures 
are advantageous because they express similarity 
not by an arbitrary "score" but by a statistical sig- 
nificance: the likelihood that such a similarity 
could be achieved by chance. This likelihood is 
also called the "P-value." We used calculations 
(described in detail in the legend to Figure 4) 
based on those given by Levitt & Gerstein (1998) to 
obtain P-values based directly on S^tr arid S^^; we 
refer to these calculated P-values as P^^^ anid P^^f 
respectively. For we could equally well have 
used the numbers from one of the popular 
sequence search programs (i.e. BLAST or FASTA) 
as aU these values have been shown to be perfectly 
proportional to each other (Levitt & Gerstein, 1998; 
Brenner d al. 1998). 

Pj.gq and Pjtr can be used to express the relation- 
ship between structure and sequence similarity on 
a more fundamental level. Figure 4(a) shows a log- 
log (base 10) plot of P^tj. against P^. Because it is 
log-log, trends can be visualized as straight lines. 
Two straight lines are necessary to fit the points 
well, with the discontinuous boundary between 
the lines located at the beginning of the twilight 
zone. The different slope of the line at low 
sequence similarity reveals that in the twilight 
zone there is a different relationship between the 
significance of structural similarity and that of 
sequence similarity. In particular, for domain pairs 



23B 



Assessing Annotation Transfer for Genomics 




Figure 2. RMS as a function of percent identity, (a) A simple scatter plot of our pairs, relating RMS separation to 
percent sequence identity. This is similar to the presentation given by Chothia & Lesk (1986), but in this survey we 
looked at 30,000 pairs, 1000 times the number they compared. Outliers (pairs with RMS scores further than two stan- 
dard deviations from the mean for their percent identity) are excluded from this graph; they represent domains that 
are very closely related with the exception of a conformational change, (b) A simplified graph with, a number of fits 
to the data. For each percent identity bin we show the median RMS value, indicated by (♦) and the top and bottom 
quartile RMS values, indicated by the bars. Two fits are drawn through the median RMS values. The thin line, 
labeled SINGLE, is a simple exponential fit through tiie medians. It has the form: 

R = 0.21e°°"2" 

where R is the RMS deviation after least-square fitting, H is tiie percent difference between the sequences (H for 
Hamming distance), and H — 100 % — /, where 7 is the percent sequence identity. The thick line, labeled MULTI, is a 
multigraph fit, which is described in the legend to Figure 4. The relation between RMS and percent identity according 
to this fit is expressed by the equation; 

The twilight zone of sequence identity and below is labeled TZ. In this region, sequence similarity is not significant 
and not reliable for predicting structural similarity. This is why the median values in this area of the graph deviate 
significantly from the fits, wHch consider only data above 20 % sequence identity. For reference we include the orig- 
inal data points from Chothia and Lesk's, 1986 paper (A.M. Lesk, personal communication), indicated by X. Their 
data follow the form: 

R = 0.40e°°^«^" 

The difference between the Chothia & Lesk trend and our relationship is due to the different trimming methods used 
in calculating the RMS score. Chothia and Lesk imposed a 3 A cut-off in determining the conserved core residues; we 
defined the core as the better matching (in terms of C" distances) half (50%) of the residue pairs, (c) and (d) The 
effect our trimming has on median RMS values. The RMS values in (c) are calculated from all the matched residues 
in each pair; the values in (d) are calculated from the better matching 50% of the residues. 



in the twilight zone (according to the percent iden- 
tity to calibration in Figure 4(b)), structural 
similarity is more significant thaii sequence simi- 
larity (having a smaller P-value or more negative 
log P-value). In contrast, for pairs with more tlian 
~30% identity, the situation is reversed, with a 
given pair having more significant sequence simi- 
larity than structural similarity. One possible 
interpretation of this reversal is as follows. Struc- 
ture is always more highly conserved than 
sequence, so usually a given amount of structural 
similarity is not as significant as a corresponding 
amount of sequence similarity. However, this is 
true only when meaningful sequence similarity 



actually exists; thus, it does not apply in the twi- 
light zone, where sequence similarity is by defi- 
nition not significant. Note that all pairs in our 
comparison share at least the same fold, implying 
tliat they always have a significant amount of 
structural similarity. 

In other words, for closely related sequences, 
differences in sequence similarity are more mean- 
ingftil, whereas for highly diverged sequences that 
share the same fold, the differences in structural 
similarity are more significant. 

Fitting two lines to the Pgir versus P^^ graph 
suggests that the same might be done for other 
scoring schemes. It is possible to some degree to fit 



Assessing Annotation Transfer for Genomics 




500 S 




740 540 340 140 -60 
Smith-Waterman Score (S^a,) 



Figure 3. Similarity scores: structural comparison score as a function of Smith-Waterman score. Alignment simi- 
larity scores 5^,^ and have certain advantages over RMS and percent identity scores for expressing the sequence- 
structure relation. S^^^ is calculated according to equation (1) in tlie text (Gcrstein & Levitt, 1998; Levitt & Gerstein, 
1998). is calculated using the BLOSUM50 matrix (Henikoff & Henikoff, 1992) with gap opening and extension 
penalties of -12 and -2, respectively, (a) This is analogous to (b) in Figure 2. From the original 30,000 pairs we show 
the median 5^^- value for each 5^^^ bin, along with qtiartile bars above and below. Again the twilight zone and below 
is labeled TZ. The thin line, marked SINGLE, is a simple fit to the median S^^ values in this graph; it has the form: 

Sstr = 2144 - 1106exp(-0.00544Sseq) 

The thick fit, marked MULTT, is the multigraph fit, explained below. It follows the equation: 

= 2157 ~ 787exp(^0.0028Ss-«[) 

The equations presented here provide an approximation of tlie observed trends; as (b) illustrates, they are nothing 
more than simple approximations. The main disadvantage of S^, as a measure of structural similarity is its heavy 
length dependency for pairs of structurally similar protein domains, (b) Surface plot of the median S^^r as a function 
of and alignment length (the nimiber of matched residue pairs). It is clear that the size of the aligned domains 
plays a major role in the resulting Sj.,,, even though our fits do not take length into account, (c) and (d) Relate 
and Sstr to the more familiar percent identity and RMS measures. The fits were used to convert between scoring 
schemes in constructing the multigraph fit. We derived the multigraph flt in order to create one set of equations and 
parameters that would relate sequence and structural similarity using either the percent identity and RMS scheme or 
the Sseq and S^t, scheme, and allow translation between them. We simultaneously performed least-squares fits to the 
median values in four graphs: Figures 2(b) and 3(a) and the cahTjrations of Sj^q to percent identity and S^^: to RMS, 
(c) and (d), respectively. In all cases, we ignored data in and below the sequence identity twilight zone (labeled TZ). 
The parameters in (a) are dependent on the parameters in Figure 2(b) via the mentioned calibrations. 



the traditional RMS versus percent identity graph 
(Figure 2) with two straight lines instead of an 
exponential cruve. However, in this case, we opted 
for the more conventional presentation. 



Class differences 

The division of SCOP into classes based on sec- 
ondary-struchiral compositioai allows easy investi- 
gation as to whether there are any deviations from 
tine common similarity relationships on account of 
secondary-structure characteristics. Figure 5(a) 
reveals that secondary structural composition does 
not markedly affect the trends in sequence and 
structure similarities. This is consistent with the 



data given by Wood & Pearson (1999). However, 
the larger average length of a/p domains com- 
pared with domains in 5ie other classes results in a 
deviation in the length-dependent S^^t (Figure 5(b)). 
The consistency among length-independent scores 
applies for certain, individual folds as well. The 
immunoglobulin fold makes up an appreciable 
fraction of aU the p-pairs (Figure 1(c)), yet the 
results are not affected if these pairs are left out. 

Linking sequence and structure to function 

Difficulties of functional comparison 

There is a clear, well-characterized relationship 
between sequence and structure similarity, which 



240 



Assessing Annotation Transfer for Genomics 




Figure 4. Probabilistic scores: P-values. Pj^and P^^^ are F-values calculated from S^eq and according to the 
formalism given by Levitt & Gerstein (1998)rBoth quantities have the same overall functional form in terms of an 
extreme value distribution: 

P=l-exp(-exp(-Z)) 

where P is either or P^t^- Fo*" ^scq' ^ = Ssoq/fl - 2 InM — b/a, where a = 5.84, b= — 26.3, and M is the geometric 
mean of the lengths of the two sequences {i-eTM^ = nm, where n and m are the two sequence lengths). For P,;,, Z is a 
function of S^^ and N, the number of matched residues: For N < 120: 

Z == (Sst - cln^ N - dlnJV - e)/(/"lnN + g) 

For N > 120: 

Z = (Sst, - «InN - Inl20 + 

At N = 120, continuity implies that; 

«ln120 + b = cln^l20 + (i]nl20 + e and fl=2clnl20 + i 

This, in turn, allows the calculation of the constants: 

a = 171.8, b = -419 A, c = 18.4, d = -4.50, e = 2.64, / = 21.4, g = -37.5 

(a) of this Figure is analogous to Figures 3(a) and 2(b), with the exception of die fits. It is a tog-log (base 10) plot 
relating P and P^^^. We show the median log(Pj,tr) value for each log(Pgpq) bin, along with quartile bars above and 
below. We nave added approximate percent identity and RMS values to ttie x and y axes to aid interpretation of the 
graph in terms of more familiar scores. The values were calculated using the caiibration curves in (b) and (c). The 
straight-line nature of the log-log plot reveals distinct relations inside and outside the twilight zone, labeled TZ. (The 
area of percent identity below the tmUght zone does not appear in P^ graphs, there is no significance for such low 
sequence similarity; thus all data points in that zone appear at Pj^q = 1 or Iog[PjgJ = 0.) The thick line in the figure is 
fit to the median P^^ values for P^eq values outside the twilight zone; its equation is: 

P,^ = 10-i"P?^ 

The thin line is fit to the data inside the IvwUght zone; it follows the relation: 

Pstr = 10-^PS^" 

For reference we include the dotted line, representing the function P^tr = Pse^ where sequence and structural simi- 
larity are equally significant. See the text for a discussion of how the two trends might be interpreted with respect to 
this line. 



can be used to transfer precisely structural annota- 
tion based on the degree of sequence homology. In 
genome analysis, however, one is usually more 
interested in finding a functional annotation for an 
open reading frame based on similarity to well- 



toown proteins; yet the sequence-function and 
structure-function relationships have not been as 
explicitly characterized. The fundamental obstacle 
to extending this and sunilar investigations to deal 
with function is the absence of a clear measure of 



Assessing Annotation Transfer for Genomics 




100 80 60 40 20 0 



% identity 




— , , L4 

560 360 160 -40 



Figure 5. SCOP class differences. Previously it has 
been observed that secondary structural composition 
does not cause deviations from tlie trends in structure 
and sequence similarity (Flores et al. 1993). To test this 
observation we looked at the scores divided by SCOP 
class. The following legend applies to the graphs: ( — 
■ — ), all alpha; (-♦-), all beta; (--A--), alpha/beta; 
(- - X - -), alpha + beta, (a) Median RMS values for 
each percent identity bin. Tire traditional scores reveal 
no dependency on class. However, in (b) a/p pairs con- 
sistently score higher S^,, scores than pairs in other 
classes. This is a consequence of the dependence of S,tr 
on length; domains in the a/ P class are longer, on aver- 
age, than in the other classes. 



functional similarity. Although we were able to 
present three different quantitative measures of 
structural relatedness, an analogous situation for 
function does not exist, How can one express 
quantitatively the degree of similarity between a 
triosephosphate isomerase and a glucose-6-phos- 
phate isomerase? How do they compare to trp 
repressor? 

The absence of a clear measure of functional 
similarity is not the only obstacle in transferring 
the functional annotations between proteins with 
different degrees of homology. The definition of 
function itself is often vague. More specifically, at 
present there is an absence of such important infor- 
mation as a standardized vocabulary for protein 
functional annotations with an associated number- 
ing scheme, descriptions of monomer functions of 
subunits of multisubunit proteins and hierarchical 
functional assignments for proteins with multiple 



functions. As a consequence of these difficulties 
there is no functional equivalent to the hierarchical 
fold classification for domains in PDB. 

As signs of progress in this direction, several 
functional classifications have been developed to 
date. One is the ENZYME system developed by 
the Enzyme Commission (EC) to classify enz)mies 
by reaction type (Webb, 1992). This system has the 
advantage 1i\at it is "xmiveisal," applicable to 
proteins in many different organisms, and is in 
wide use. However, it also has several drawbacks. 
First of all, it does not consider catalytic reaction 
mechanisms (Riley, 1998a), often ignoring obvious 
similarities. Second, it presumes a 1:1:1 relationship 
between gene, protein and reaction, although this 
is often not the case (an enzyme can have 
two functions, or two pol3Apeptides from two 
different genes can oUgomerize to perform a single 
function). Perhaps the most significant drawback 
of the EC classification is that it applies to only 
enzymes. 

A number of more comprehensive schemes 
have been developed, which classify non- 
enz3nnes as well as enzymes. Most of these 
focus on individual organisms. Several such 
schemes exist, for instance, GenProtEC/EcoCyc 
for E. coli (Karp et al, 1998b; Riley & Labedan, 
1996; Riley, 1998b), MIPS for yeast (Mewes et al, 
1998), Ashbumer's functional classification for 
Drosophila, which is coimected to FLYBASE 
(Ashbumer & Drysdale, 1994), and EGAD for 
human ESTs (Adams et al, 1995). These classifi- 
cations possess some advantages. They have 
additional levels of hierarchy that help present a 
more comprehensive picture of genotype-pheno- 
type relationships. On the other hand, these 
classifications still leave much room for improve- 
ment. For example, there is no standardized 
vocabulary to allow for keyword searches 
among multiple databases and across organisms, 
and ffiere are inconsistencies in category num- 
bering style. 

Finally, there has been some promising work 
going beyond the ENZYME and organism-focused 
classifications. There has been progress on comple- 
tely automated functional classification (des Jardins 
et al, 1997; Tamames et al, 1997), which has the 
potential for putting function assignments on a 
more objective basis. There are a number of data- 
bases synthesizing tlie various enzyme functions 
into coherent pathways and systems (e.g. KEGG 
and WIT, Ogata et al, 1999; Selkov et al, 1998). 
There also have been some very recent attempts to 
develop cross-species classifications of non-enzyme 
functions in the framework of the Gene Ontology 
Project (GO, geneontology.org), GO is a joint pro- 
ject between FlyBase, the Saocharomyces Genome 
Database and Mouse Genome Informatics, 
attempting to merge the fly, yeast and mouse 
functional classification schemes. However, a truly 
universal system for classifying all protein func- 
tions in aU organisms within the same framework 
remains quite a challenge because of the 



242 



Assessing Annotation Transfer for Genomics 



sheer diversity of organisms and distinct protein 
functions. 



Our simple functional classification of SCOP 
domains: FLY+ ENZYME 

Given the discussed limitations, we constructed 
a simple functional classification for the SCOP 
domains included in out comparison; our classifi- 
cation is based on a merger of two of the existing 
functional armotatioris and a cross-referencing of 
subsets of this combination with some of the 
organism-specific schemes. First, we used pairwise 
comparison to cross-reference the PDB domains 
against the Swissprot database {Bairoch & 
Apweiler, 1998), as described by Hegjn & Gerstein 
(1999). We chose to assign protein functions 
according to Swissprot because it provides more 
comprehensive functional annotations than SCOP, 

We were initially able to divide all entries into 
enzymes and non-enzymes, a division that rep- 
resents the highest level of functional difference in 
our classification scheme (Figure 6). For the 
enzyme category, we transferred EC (Webb, 1992) 
nurnbers to those SCOP domains with a one-to-one 
match to a Swissprot enzyme. Only one-to-one 
matching entries could be considered because 
Swissprot assigns ENZYME numbers to entire pro- 
teins, whereas SCOP is a domain-based classifi- 
cation; therefore we could be confident about the 
classification of only those domains which map to 
an entire Swissprot entry. 

In the absence of an EC-type classification for 
non-enzymes, we assigned functions to non-enzy- 
matic SCOP domains according to Ashbumer's 
original classification of Drosophila protein func- 
tions. This classification is derived from a con- 
trolled vocabulary of fly terms. It is available on 
the web and loosely connected with the FI.YBASE 
database (Ashburner &: Drysdale, 1994). For clarity, 
we precisely describe the specific files and version 
(1.55, 1997) of the classification that we used in the 
caption to Figure 6, and we will hereafter refer to 
these data files as constituting the original FLY 
classification. 

The FLY classification is a dynamic object, chan- 
ging as more is learned about the fly and other 
organisms. This is particularly true of late with the 
imminent completion of the Drosophila genome. In 
fact, since the completion of our analysis, the FLY 
classification has been superceded by the new GO 
classification (see above). 

The hierarchical structure of the FLY classifi- 
cation makes it well suited for classifying non- 
enzj^natic SCOP entries in a manner comparable 
to the ENZYME assigrmients for the eivz3anes. 
Another advantage of tihis classification is that it is 
more compatible with the makeup of the PDB than 
the E. coli and yeast classifications, as Drosophila is 
a multi-cellular organism, and many of the known 
structures come from animals. We were able to use 
the original FLY classification as a framework to 



which we added functional categories and individ- 
ual proteins. For instance, we added "Hemo- 
globin" to the "Physiological Processes - 
Respiration" category. Another example is the 
"Physiological processes - hmnunity" category 
(Figure 6^)), to which we added immune system 
proteins. Many of the additions would not be 
necessary in the context of the new cross-species 
GO sy.stem. We also modified slightly the number- 
ing scheme in the original FLY classification in 
order to assign a unique hierarchical number to 
each protein domain (Figure 6(b)). We will refer to 
our augmented FLY classification as the FLY-h 
scheme, and our merged scheme as the FLY+ 
ENZYME classification. 

As discussed earlier, the universal functional 
classification of proteins is very challenging and 
may not be possible with the current level of 
knowledge about genes, proteins and genomes. 
Consequently, the FLY -f- ENZYME classification 
of SCOP proteins is somewhat incomplete and 
inconsistent and retains many of the limitations 
of its components (Hegyi & Gerstein, 1999; 
Riley, 1998a). It is not yet broad enough to 
include many plant, virus and bacterial proteins. 
Nevertheless, it was sufficient for our analysis, 
as we were able to classify a very large number 
of the total 30,000 pairs. 



Determining functional similarity 

Using our compound functional classification, 
we were able to assign a level of functional simi- 
larity to each domain pair. According to our 
scheme, a pair can have no functional similarity 
(an enzyme paired with a non-enzyme) or it can 
have one of three levels of similarity: 

(i) General similarity. Both domains are 
enzymes or both are non-enzjmies. 

(ii) Same functional dass. Both domains share 
the first component of their ENZYME or FLY-I- 
numbers, e.g. 1.1.1.1 alcohol dehydrogenase and 
1.3.1.1 cortisone beta-reductase (for enzymes), or 
3.3.2.1.2 calcicyclin and 3.6,3.2.1 calmodulin (for 
non-enzymes). 

(ui) Same precise function. Both domains share 
three components of their ENZYME or FLY -|- 
number, e.g. 1.1.1.1 alcohol dehydrogenase and 
1.1.1.3 homoserine dehydrogenase (for enzjmies) 
or 1.2.9.1.1.1 Arc repressor and 1.2.9.1.1.1 C-jun 
(for non-enzymes; both are transcription factors). 
A pair that shares precise fimction must also, by 
definition, share functional dass and general 
similarity. 

Based on those assignments we calculated the 
percentage of total pairs at a given level of 
sequence or structural similarity possessing each 
level of functional similarity. The results appear in 
Figure 7. 



Assessing Annotation Transfer for Genomics 



Sequence and function 

The relation between sequence similarity and 
functional similarity behaves as one might expect, 
with sigmoidal curves that drop off sharply at par- 
ticular conservation thresholds, and with the three 
levels of functional similarity (precise function, 
functional class and general similarity) having pro- 
gressively lower thresholds. Figure 7(a) shows that 
precise function is not conserved below 30-40% 
sequence identity, whereas functional class is con- 
served for sequence identities as low as 20-25%. 
Below 20%, general similarity is no longer con- 
served; among pairs of approximately 7% 
sequence identity, about 40% are enzymes paired 
with non-enzymes. It is important to note that in 
all the pairs considered here, the domains share 
the same fold. Functional similarity at low percent 
identities (e.g. 7%) would be much less for all 
possible pairs of domains rather than just for those 
with the same fold. It is also important to remem- 
ber that our thresholds for functional conservation 
are statistical averages over many sequences; one 
will, of cotiTse, be able to find individual cases that 
diverge more or less rapidly. 

There are differences between the functional con- 
servation thresholds of enzymes and non-enzymes, 
with enzymes appearing to more highly conserve 
precise function than non-enzymes, but non- 
enz3Tnes conserving functional class more highly 
than enzymes. This may reflect that in our classifi- 
cation, the non-enzyme functional classes are 
broader and hence easier to conserve than those of 
the enzjones, while the non-en2;3nnatic precise 
functions are more specific. 

When P^^ is used as the measure of sequence 
similarity (Figure 7(b)) the results look somewhat 
different, it appears that functional class is con- 
served for the entire range of sequence similarities. 
In this case, percent identity is actually more discri- 
minating than Pjeq because functional class 
diverges only at sequence sunilarities that are low 
enough that they have little or no statistical signifi- 
cance, i.e. for JPgeq the divergence is compressed 
near the vertical axis of the graph. 



Structure and function 

The relation between similarity in structure and 
function is somewhat less straightforward than 
that between similarity in sequence and function. 
Figure 7(c) shows the relationship between RMS 
and functional similarity. Broadly, it appears simi- 
lar to that for percent identity and functional simi- 
larity; however, the thresholds for conservation of 
the various types of functional similarity are less 
sharp. 

RMS is more revealing with respect to functional 
similarity than the non-traditional structural scores, 
Sgtr and Pj,r. (Data for S^^^ and P^^ are not shown 
but are available from the website.) The reason is 
that, while very structurally similar pairs all have 
RMS scores clustered between 0 and 0.5 A, has 



a large range of scores for similar pairs due to the 
length dependency, and P^^^ does not have any 
limit for maximum similarity. The wide range of 
possible Sjtf and P^i^ scores for similar structures 
tends to blur the broad sigmoid curves so much so 
that they are no longer apparent. 

Aftemaiive functional classifications: MIPS 
and GenProtEC 

To get some perspective on the degree to which 
our results reflected the particularities of our com- 
bined FLY -{-ENZYME classification, we decided 
to try tl-ie same comparisons based on the well- 
knovm functional classifications for yeast and 
E. coli, MIPS and GenProtEC (Mewes et al, 1998; 
Riley & Labedan, 1996; Riley, 1998b). These classi- 
fications have the advantage that they integrate 
erayme and non-enzyme functions from the start 
and are widely used. However, as they are only 
applicable to individual organisms, we could only 
use them to classify a considerably smaller subset 
of the known structures than the compound FLY + 
ENZYME system. 

The specific way we used the MIPS and Gen- 
ProtEC classifications to assign function to struc- 
tures and to calculate functional similarities is 
described in the legend to Figure 7. Our results 
in terms of functional conservation (precise and 
dass) at various levels of percent identity are 
shown in Figure 7(d). We observe the same gen- 
eral relationships as we did for our FLY- 
H- ENZYME scheme. That is, the functional 
conservation curves have a sigmoidal shape and 
have cut-offs for precise functional similarity 
after 40% and for functional class similarity at 
lower values. However, because the MIPS and 
GenProtEC classifications are restricted to indi- 
vidual organisms, each curve represents con- 
siderably fewer data points than do the curves 
based on the FLY -l- ENZYME scheme; tiiis 
required us to "bin" the MIPS and GenProtEC 
curves in a somewhat coarser fashion. 



Discussion and Conclusion 

Here, we assessed the transfer of functional and 
structural annotation by analyzing the relation- 
ships between similarity in sequence, structure and 
function. The ~30,000 protein domain pairs of 
varying levels of similarity (at least the same fold) 
that we constructed out of the SCOP classification 
show quantitative sequence-structure relationships 
consistent with previous research. The exponential 
relationship is consistent across the secondary- 
structural classes and holds for newer probabilistic 
scoring methods. 

Tlie sequence-function and structure-function 
relationships have not been studied as precisely 
due to the lack of a robust functional classification 
and measure of functional similarity. To overcome 



244 



Assessing /^notation Transfer for Genomics 





O Precise functional similarity # General similarity ® Functional class similarity 

Figure 6. Functional classification of enzymes ar\d non-enzymes, (a) Divides the pairs by general function. There 
are three categories of pairs: (i) enzymes paired with non-enzymes (no general functionai similarity), labeled ENZ/ 

ENZ; (ii) eiYzymes paired with enzymes (same general function), labeled ENZ/ENZ; and (iii) non-enzymes paired 
with non-enzymes (same general function). Pairs for which one or both domains could not be identified as enzyme 
or non-enzyme are not included in this chart. EnzjTiies are classified according to the EC system (Webb, 1992). The 
first component of the number represents the nature of reaction and is called class. There are six classes: oxidoreduc- 
tases, transferases, hydrolases, lyases, isomerases and ligases. The next level is subclass. It refers to the chemical 
groups on wliidi ilie enzyme acts. For example, the first class, oxidoreduclases, has 19 subclasses that are arranged 
according to the donor group that undergoes oxidation (CH-OH, aldehyde or oxo group, CH-CH group, etc). For 
another group of enzymes (hydrolases) subclass is determined by the nature of the bond: ester bond, peptide bond, 
etc. The next level is sub-subclass. For oxidoreductases this indicates the acceptor group: NAD(+) and NADP(4-), or 
C3rtochrome; for hydrolases the sub-subdass represents the nature of substrate (carboxylic ester hydrolases, thiolester 
hydrolases, etc.). The fourth level represents a unique number for each individual enzyme, for example, 1.1.1.1: alco- 
hol dehydrogenase, (b) Shows how we adapted the functional classification of Drosophiia gene products developed 
by M. Ashbumer. This classification is loosely connected with FLYBASE (Ashburner & Drysdale, 1994). We used ver- 
sion 1.5S (4 August 1997) that was available from Ashbumer's website: 

http : //www.ebi.ac.uk/ ~ ashbum 

The specific files that we used were taken from the ftp directory: 



ftp.ebi.ac.uk/databases/edgp/misc/ashbumer 



Assessing Annotation Transfer for Genomics 



this we constructed our own classification by mer- 
ging and extending the ENZYME and FLY 
schemes and assigning levels of functional simi- 
larity. Our measures of functional similarity pro- 
vide curves relating function to sequence and 
structure; when relating functional conservation to 
sequence divergence, we find distinct thresholds at 
'^4^1 % for precise function and ~25% for func- 
tional class. 

One of the interesting results that emerges from 
this is that percent identity is more useful for quan- 
tifying functional divergence than the newer prob- 
abilistic scores. In general, modem probabilistic 
scores, such as P^, are better at discriminating 
amongst highly diverged sequences (near the twi- 
light zone) than percent identity, since they better 
take into account gaps and conservative substi- 
tutions (of similar amino acids). However, for very 
similar pairs of sequences, percent identity is a 
simpler and more direct measure of divergence 
(essentially a Hamming distance). Since divergence 
in precise function takes place before that in struc- 
ture (well before the twilight zone), it is quite 
reasonable that percent identity is more successful 
at measuring the former than the latter and that 



the converse is true for the probabilistic scores. In 
other words, percent identity is better calibrated 
for discriminating amongst very close, significant 
relationships and P^^ for more distant ones. 



Practical implications 

The sequence-structure and sequence-function 

relationships described here provide practical 
information for genome annotation in terms of 
folds and functions. Table 1 summarizes the rela- 
tive advantages of the different scoring methods 
we used. Using the trends in sequence and struc- 
ture similarity, one can assess the degree to which 
structural annotation can be transferred between 
sequences at a given level of sequence similarity. 
The sequence and function similarity thresholds 
potentially establish minimum requirements of 
sequence similarity for reliable function prediction. 
Note that because the protein domain pairs con- 
sidered here all share the same fold, the numbers 
for all possible pairs will differ in the region of 
very litfle sequence identity, in which the sequence 
similarity is not enough to indicate the same fold. 



We refer to these as constituting the original FLY classilication. Recently, the FLY classification has been superceded 
by the GO (Gene Ontology) Project classification, which merges fly, mouse and yeast annotation. Files related to the 
GO classification are available from www.geneontology.org In the original FLY classification all members of the high- 
est level are labeled 0, representatives of the next level are labeled 1, and all lower levels are labeled 2 through to 9. 
We changed the numbering scheme so that it will reflect the hierarchical nature of the classification. This 
Figure illustrates sections of the original and modified dassificalion. The top level in the FLY classification scheme is 
called "Function primitive" (level 0) and includes five classes: "Metabolism," "Intracellular protein traffic/' "Cell 
structure," "Developmental process," "Physiological process," and "Behavior." The next level after "Function primi- 
tive" is "Process" or "Molecule" (level 1 in Ashbumer's classification). For "Function primitive - Metabolism" the 
processes are "Carbohydrate metabolism," "Nudeotides and nucleic acids metabolism," etc. For "Function prirrutive 
- Cell Structure" the 'Trocess" can be 'TMucleus," "Mitochondrion," "Membrane," etc. The next level is "Pathway" 
or "Macromolecule" (level 2 in the original classification), "Pathway" can indude "Metabolic pathway," "Signaling 
pathway," or "Developmental pathway." The "Macromolecule" category includes 'Trotein" and "Nucleic Add". We 
added categories to the original classification in order to classify some mammalian proteins tliat are widely rep- 
resented in SCOP but are absent from the original FLY scheme. These categories include immune system proteins 
(labeled "new" in (b) and respiratory proteins such as hemoglobin and myoglobin that we added to "Function primi- 
tive - Physiological process - Respiration". We cal! our adaptation of the original FLY scheme, FLY + . Further infor- 
mation on this adaptation is available at 

http : //bioLnfo.mbb.yale.edu/aiign/func 

(c) The overall hierarchy of our final sdieme and identification of the different levels of similarity. If two proteins are 
both enzymes or both non-enzymes, tlien they possess general functional similarity. If they share the first component 
of their classification numbers, then tliey are in the same functional class. If they share the first three components of 
their enzyme numbers (or the equivalent for non-enzyme numbers, dependii^ on category) then they have the same 
precise function. A significant difference between the two main branches of the hierarchy is that the levels of the 
ENZYME dassificatlon do not correspond exactly to those in the FLY-f system because the fly dassification is more 
extensive than the enzyme classification. For instance, the FLY classification takes into account aspects of cellular 
(cytoskeleton, metabolic pathways, etc.) and phenotypic function (morphology, physiology, behavior) that are absent 
from the ENZYME scheme. This makes our classification of SCOP proteins somewhat unbalanced, as non-enzymes 
have much broader and more loosely defined functional classes. As a consequence, wliile each enzjTne is assigned a 
four-component number, the length of a non-enzyme nxmiber varies, depending on the functional category to which 
it belongs. For example, myosin is assigned a number that happens to have the same length as EC numbers: 3,12.1,1. 
However, ttanscription factors are numbered 1.12.9.1.1.1. We took into account this varying hierarchy depth in derid- 
ing how many components are necessary to identify precise function in each category. Note that what we mean by 
domains having the same predse function is not the same as the domains coming from the same essential protein. 



Assessing Annotation Transfer for Genomics 





Figure 7. Linking sequence, structure and function. We express functional similarity as the fractional percentage of 
pairs at a given level of sequence/structural similarity for which the paired domains share a precise function, func- 
tional class, or general similarity (according to our classification, see Figure 6). The following legend applies to (a) 
through (c): ( — O — ), general similarity; ( — x — ), non-enzymes with same functional class; ( — A — )/ enzymes 
with same functional calss; (--- x ---), non-enzymes with same precise function; and (---A-—)/ enzymes with the 
same precise function, (a) Relates functional similarity to sequence similarity in terms of percent identity. The func- 
tional similarity appears as a sharp sigmoid, with distinct thresholds of divergence for precise function, functional 
class, and general similarity. Enzymes are paired with non-enzymes only at very low percent identity, in and below 
the twilight zone (labeled TZ). At slightly higher sequence identity, pairs diverge with respect to functional class, and 
beyond 40% identity with respect to precise function. Note that 50-100% identity is not shown because almost all 
domains that are that similar share function with their counterparts, (b) Shows the same data using P^cq as the 
measure of sequence similarity. Only the divergence in precise function is visible because there is such little signifi- 
cance for the low sequence similarity at which functional class and general similarity diverge, all data points in that 
region appear near P^^-l or loglP^eq] =0 (the y-axis). (c) Illustrates that the structure-function relation is not as 
clearly defined as that for sequence and function. Functional similarity expressed in terms of RMS separation appears 
as a broad sigmoid curve; there are thresholds of divergence for precise function, but the divergences in functional 
class and general similarity are more gradual. The thresholds are apparent only because RMS clusters the most struc- 
turally similar pairs between scores of 0 and 0.5 A. For this reason, RMS is better at discerning functional similarity 
than S,i, and P^^,, which do not cluster the most similar pairs around a set limit, (d) Shows the same relationships 
(functional conservation versus percent identity) as in (a), except that for this graph functional similarity is determined 
in terms of the MIPS (Mewes et al, 1998) and GenProtEC (Riley, 1998b) classifications rather than the FLY- 
+ ENZYME scheme. The legend appears as the inset on the graph. We assigned MIPS and GenProtEC classifications 
to SCOP domains based on sequence comparisons to classified yeast and £. coli open reading frames (ORFs), respect- 
ively. The SCOP domain most closely matching each ORF classified in MIPS or GerJ'rotEC was assigned the corre- 
sponding MIPS or GenProtEC function number. Only matches of 80 % sequence identity or greater were considered. 
We used this SCOP domain as a functional representative; when determining functional similarity, we assigned to 
SCOP domains with no MIPS or GenProtEC fimctional de.signation the function of the closest representative with at 
least 85 % sequence identity, if one existed. GenProtEC functional identifiers are three-component numbers. We con- 
sider a pair of domains sharing the first component of their fuiwrtional designation to be in the same functional class. 
Domains that share edl three components are said to have the same precise function. For MIPS the functional desig- 
nation is not as straightforward, as one ORF can be assigned multiple functions. Therefore we consider domains 
which have at least one function in common to share functional class. Domains with all functions in common, the 
same combination of identifiers, share precise function. Because MIPS and GenProtEC each classify the proteins of a 
single organism, yeast and E. coU, respectively, these classifications can detemiLne the functional similarities of only a 
small firaction of ail our SCOP domain pairs. The data based on tiiese classifications, appearing in (d), are therefore 
very sparse compared to the data in (a)-(c). Despite the coarseness of the data, functional similarity based on the 
MIPS and GenProtEC classifications follows the same general relation to sequence similarity as does functional simi- 
larity based on the more comprehensive FLY -I- ENZYME scheme. Vertical line indicates an approximate threshold of 
funcidona! divergence at 40% identity. 



Assessing Annotation Transfer for Genomics 



Table 1. Summarj' of scoring methods 



Sequence similariU' Stnicturai siniilariH' 



Features 



Limitations 



Traditional scores 



Per cent sequence 
identily 



RMS C" separation 



Alignment siimlarity 5^^,, 



Modem probabilistic Pg^^ 



Well understood, in use; RMS depends most highly on 

percent identity better for worst matches, requiring 

looking at functional arbitrary trimming; percent 

similarity identity is insensitive to gaps 
and conservative substitutions 

Analogous similarity scores. Dependence on alignment 

depends most highly length 
on best matches 

Statistical significance. Not as familiar as RMS and 

unified framework for percent identity 
different comparisons 



The Table lists the schemes presented here for characterizing the sequence-structure reladonsh^, along with their relative advan- 
tages and disadvantages. 



Practically, then, when one searches an unchar- 
acterized open reading frame against known struc- 
tures, if the open reading frame matches a 
structure with a good e-value or percent identity, 
then the curves presented here can be used to 
check how the functional and detailed structure 
annotation will transfer. For example, if an 
unknown open reading frame matches a PDB 
Structure with an e-value of 0.001 and a percent 
identity of 30%, then one can be assured that it 
has the same fold (Brenner et ah, 1998) and accord- 
ing to our analysis it has a two-thirds chance of 
having the same exact function. Furthermore, it 
has a ~99 % chance of having the same functional 
class and its structure probably diverges from the 
known structure by a trimmed RMS of less than 
0.7 A. 



Future directions 

There are a number of directions in which we 
might extend this analysis. With respect to the 
sequence-structure relahon, we can reduce the 
overrepresentation of the immunoglobulins and 
improve the calculation of (by redoing the fit 
to the extreme value distribution reported by 
Levitt & Gerstekt (1998) to eliminate residual 
length-dependency. 

In the functional realm, we can investigate if and 
how tlie sequence-function and structure-function 
relationships vary for different categories of pro- 
teins. For example, although we found consistency 
of the sequence-structure relationship among sec- 
ondary structural classes, Hegyi & Gerstein (1999) 
found that the distribution of enzymes and non- 
enzymes varies with secondary structural class. 
A related issue is that of conformational changes. 
It is conceivable that among domains with very 
similar sequences but structures that differ by a 
conformational change, function is less conserved 
than it is among similar sequences with more simi- 
lar structures. 

Perhaps the most important direction in which 
to further this work is the augmentation of the 
functional classification. Wilh the growing 



amount of fully sequenced genomes there is a 
need for the development of a comprehensive 
system for functionally classifying proteins, a 
complete classification for the entire universe of 
protein functions. It will be a difficult process, 
as many existing organism-specific classifications 
will have to be merged, but the end result will 
have the advantage of not being biased towards 
any one orgarusm. Such a imiversal classification 
will allow much more reliable transfer of fimc- 
tional annotation. 



Acknowledgments 

We thank A. Lesk for helpful conversations and sup- 
pljfing us with reference data for Figure 2, S. Brenner for 
providing carefully curated SCOP doraain sequences, 
and H. Hegyi, W. Krebs and V. Alexandrov for assist- 
ance with the sequence comparisons, development of the 
FLY + ENZYME scheme, and design of the web data- 
base. M.G- thanks the Keck and Donaghue foundations 
for financial support. 



References 

Abagyan, R. A. & Batalov, S. (1997). Do aligned 
sequences share the same fold? /. Moi. Biol. 273, 

355-368. 

Adams, M. D., Kerlavage, A. R., Fleischmann, R. D., 
Fuldner, R. A., Bult, C. J., Lee, N. H., Kirkness, E. F., 
Weinstock, K. G., Gocayne, J. D., White, O., Venter, 
J. C, et al. (1995). Initial assessment of human gene 
diversity and expression patterns based upon 83 
million nucleotides of cDNA sequence. Nature, 377, 
3-174. 

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & 
Lipman, D. J. (1990). Basic local alignment search 
tools. /. Mol. Biol. 215, 403^10. 

Altschui, S. F., Boguski, M. S., Gish, W. & Wootton, J. C. 
(1994). Issues in searching molecular sequence data- 
bases. Nature Genet. 6, 119-129. 

Allsdnul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., 
Zhang, Z-, Miller, W. & Lipman, D. J. (1997). 
Gapped BLAST and PSI-BLAST: a new generation 



248 



Assessing Annotation Transfer for Genomics 



of protein database search programs, Nucl. Acids 
Res. 25, 3389-3402. 

Andrade, M. A. & Sander, C. (1997). Bioinformatics: 
from genome data to biological knowledge. Curr. 
Opin. Biotech. 8, 675-683. 

Ashbumer, M. & Drysdale, R. (1994). Flyfaase: the Droso- 
phila genetic database. Development, 120, 2077-2079. 

Attwood, T. K., Flower, D. R., Lewis, A. P., Mabey, J. E., 
Morgan, S. R., Scordis, P., Selley, J. N. & Wright, 
W. (1999). PRINTS prepares for the new millen- 
nium. Nucl. Acids Res. 27, 220-225. 

Bairoch, A. & Apweiler, R. (1998). The SWISS-PROT 
protein sequence data bank and its supplement 
TrEMBL in 1998. Nucl. Acids Res. 26, 38-42. 

Bernstein, F. C, Koetzle, T. F., Williams, G. J. B., Meyer, 
E. P., Jr, Brice, M. D., Rodgers, J. R., Kennard, O., 
Shimanouchi, T. & Tasiuni, M. (1977). The protein 
data bank: a computer-based archival file for 
maaomolecular structures. /. Mol. Biol 112, 535-54Z 

Bork, P. & Koonin, E. V. (1996). Protein sequence motifs. 
Curr. Opin. Struct. Biol. 6, 366-376. 

Bork, P., Ouzounis, C. & Sander, C. (1994). From gen- 
ome sequences to protein function. Curr. Opin, 
Struct. Biol. 4, 393-403. 

Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., 
Huynen, M. & Yuan, Y. (1998). Predicting function; 
from genes to genomes and back. /. Mol. Biol. 283, 
707-725. 

Brenner, S. E. (1999). Errors in genome annotation. 

Trends Genet. 15, 132-133. 
Brenner, S. E., Chothia, C, Hubbard, T. J. & Murzin, 

A. G- (1996). Understanding protein structure: using 

SCOP for fold interpretation. Methods Enzymol. 266, 

635-643. 

Brenner, S. E., Chothia, C. & Hubbard, T. J. (1998). 
Assessing sequence comparison methods with 
reliable slruxrturally identified distant evolutionary 
relationships. Proc. Natl Acad. Sd. USA, 95, 6073- 

6078. 

Bryant, S. H. & Altschul, S. F. (1995). Statistics of 
sequence-structure tiireading. Curr. Opin. Struct. 
Biol 5, 236-244. 

Chothia, C. & Lesk, A. M. (1986). The relation between 
the divergence of sequence and structure in pro- 
teins. EMBO J. 5, 823-826. 

Chothia, C. &c Lesk, A. M. (1987). The evolution of pro- 
tein structures. Cold Spring Harbor Symp. Quant. 
Biol 52, 399-405. 

des JardLns, M., Karp, P. D., Krummenacker, M., Lee, 
T. J. & Ouzounis, C. A. (1997). Prediction of enzyme 
classification from protein sequence without the use 
of sequence similarity. ISMB, 5, 92-99. 

Doolittle, R. F. (1987). Of Urfs and Orfs, University 
Science Books, Mill Valley, CA, USA. 

Enright, A. J., Iliopoulos, I., Kyrpides, N. C. & 
Ouzounis, C. A. (1999). Protein interaction maps for 
complete genomes based on gene fusion events. 
Nature, 402, 86-90. 

Fetrow, J. S. & Skolnick, J. (1998). Method for prediction 
of protein function from sequence using the 
sequence to-structure-to-function paradigm with 
application to glutaredoxins/thioredoxins and T, 
ribonucleases. /, Mol Biol 281, 949-968. 

Fetrow, J. S., Godzik, A. & Skolnick, J. (1998). Functional 
analysis of the Escherichia coli genome using the 
sequence-to-structure-to-function paradigm: identifi- 
cation of proteins exhibiting the glutaredoxin/thior- 
edoxin disulfide oxidoreductase activity. /. Mol Biol 
282, 703-711. 



Flores, T. P., Orengo, C. A., Moss, D. S. & Thornton, 
J. M. (1993). Comparison of conformational charac- 
teristics in structurally similar domain pairs. Protein 
Scl 2, 1811-1826. 

Fraser, C. M., Gocayne, J. D., White, O., Adams, M. D., 
Clayton, R, A., Fleischmann, R. D., Bult, C. J., 
Kerlavage, A. R., Sutton, G., Kelley, J. M., Venter, 
J. C, et al (1995). The minimal gene complement of 
Mycoplasma genitalium. Science, 270, 397-403. 

Fraser, C. M., Norris, S. J., Weinstock, G. M., White, O., 
Sutton, G. G., Dodson, R., Gwirm, M., Hickey, E. K., 
Clayton, R., Ketchum, K. A., Sodergren, E., 
Hardham, J. M., McLeod, M. P., Salzberg, S., et al 
(1998). Complete genome sequence of Treponema 
pallidum, the syphilis spirochete. Science, 281, 375- 
388. 

Gersteiiv M. (1997). A structural census of genomes: 
comparing bacterial, eukaryotic, and archaeal gen- 
omes in terms of protein structure. /. Mol Biol 274, 
562-576. 

Gerstein, M. (1998a). Measurement of the effectiveness 
of transitive sequence comparison, through a third 
'intermediate' sequence. Bioinformatics, 14, 707-714, 

Gerstein, M. (1998b). Patterns of protein-fold usage in 
eight microbial genomes: a comprehensive struc- 
tural census. Proteins: Struct. Funct. Genet, 33, 518- 
534, 

Gerstein, M. (1998c). How representative are the known 
structures of the proteins in a complete genome? A 
comprehensive structural census. Folding Des. 3, 
497-512. 

Gerstein, M. & Altman, R. (1995). Average core struc- 
tures aitd variability measures for protein families: 
application to the iirununoglobuliris. /. Mol Biol 
251, 161-175. 

Gerstein, M. & Hegyi, H. (1998). Comparing microbial 
genomes in terms of protein structure: surveys of 
a finite parts list. FEMS Microbiol Rev. 22, 277- 
304. 

Gerstein, M. & Levitt, M. (1996). Using iterative 
dynamic programming to obtain accurate pairwise 
and multiple alignments of protein structures. 

ISMB, 4, 59-67. 

Gerstein, M. & Levitt, M. (1998). Compreliensive assess- 
ment of automatic structural alignment against a 
manual standard, the SCOP classification of pro- 
teins. Protein Scl 7, 445-456. 

Hegyi, H. & Gerstein, M. (1999). The relationship 
between protein structure and function: a compre- 
hensive survey with application to the yeast gen- 
ome. /. Mol Biol 288, 147-164. 

Heinikoff, S. & Heinikoff, J. G. (1992). Amino acid sub- 
stitution matrices from protein blocks. Proc. Natl 
Acad. Scl USA, 89, 10915-10919. 

Hubbard, T. J. P., Murzin, A. G., Brenner, S. E. & 
Chothia, C. (1997). SCOP: a structural classification 
of proteins database. Nucl Acids Res. 25, 236-239. 

Karlin, S. & Altschul, S. F. (1990). Methods for assessing 
the statistical significance of molecular sequence fea- 
tures by using general scoring schemes. Proc. Natl 
Acad. Scl USA, 87, 2264-2268. 

Karlin, S. & Altschul, S. F. (1993). Applications and stat- 
istics for multiple high-scoring segments in molecu- 
lar sequences. Proc. Natl Acad. Scl USA, 90, 5873- 
5877. 

Karlin, S., Bucher, P., Brendel, V. & Altschul, S. F. 
(1991). Statistical methods and insights for protein 
and DNA sequences. Annu. Rev. Biophys. Biophys. 
Chem. 20, 175-203. 



Assessing Annotation Transfer for Genomics 



249 



Karp, P. D. (1996). A protocol for maintaining multida- 
tabase referential integrity. Pac. Symp. Biocomput. 
438-445. 

Karp, P. (1998a). What we do not know about sequence 

analysis and sequence databases. Bioinformatics, 14, 
753-754. 

Karp, P. D., Riley, M., Paley, S. M., PeUegrini-Toole, A. 
& Kruirunenacker, M. {1998b). EcoCyc: enc]«dope- 
dia of Escherichia coli genes and metabolism. Nucl. 
Acids Res. 26, 50-53. 

Karp, P. D., Ouzounis, C. & Paley, S. M. (1996b). flin- 
Cyc: a knowledge base of the complete genome and 
metabolic pathways of H. influenzae. ISMB, 4, 116- 
124. 

Lesk, A. M. & Chothia, C. (1984). Mechanisms of 
domain closure in proteins. /. Mol. Biol 174, 175- 
191. 

Levitt, M. & Gerstein, M. (1998). A unified statistical fra- 
mework for sequence comparison and structure 
comparison. Proc. Natl Acad. Sci. USA, 95, 5913- 
5920. 

Mewes, H. W., Hani, J., Pfeiffer, F. & Frishman, D. 
(1998). MIPS: a database for protein sequences and 
complete genomes. Nucl. Acids Res. 26, 33-37. 

Moult, J., Hubbard, T., Fidelis, K. & Pedersen, J. T. 
(1997). Critical assessment of methods of protein 
structure prediction (CASP): round H. Proteins: 
Struct. Fund. Genet. 1, 2-6. 

Murzin, A., Brenner, S. E., Hubbard, T. & Chothia, C. 
(1995). SCOP: a structural classification of proteins 
for the investigation of sequences and structures. 
/. Mol. Biol 247, 536-540. 

Myers, E. & Miller, W. (1988). Optimal alignments in 
linear space. Comput. Appl Biosci. 4, 11-17. 

Needleman, S. B. & Wunsch, C. D. (1971). A general 
method applicable to the search for similarities in 
the amino acid sequence of two proteins. /. Mol 
Biol 48, 443-453. 

Ogata, H., Goto, S., Sato, K., FujibucW, W-, Bono, H. & 
Kanehisa, M. (1999). KEGG: Kyoto Encyclopedia of 
genes and genomes. Nucl Acids Res. 27, 29-34. 

Park, J., Teichmann, S. A., Hubbard, T. & Chothia, C. 
(1997). Intermediate sequences increase the detec- 
tion of homology between sequences. /. Mol Biol 
273, 349-354. 

Park, J., Karplus, K., Barrett, C, Hughey, R., Haussler, 
D., Hubbard, T. & Chothia, C. (1998). Sequence 
comparisons using multiple sequences detect three 
times as many remote homologues as pairwise 
methods. /. Mol Biol 284, 1201-1210. 

Pearson, W. R. (1996). Effective protein sequence com- 
parison. Methods EnzymoJ. 266, 227-259. 

Pearson, W. R. (1998). Empirical statistical estimates for 
sequence similarity searches. /. Mol Biol 276, 71-84, 

Pearson, W. R. & Lipman, D. J. (1988). Improved tools 
for biological sequence comparison. Proc. Natl Acad. 
Sci. USA, 85, 2444-2448. 



Riley, M. {1998a). Systems for categorizing functions of 
gene products. Curr. Opin. Struct. Biol 8, 388-392. 

Riley, M. (1998b). Genes and proteins of Escherichia coli 
K-12. Nucl Acids Res. 26, 54. 

Riley, M. & Labedan, B. (1996). £. coli gene products: 
physiological functions and common ancestries. In 
Escherichia coli and Salmonella; Cellular and Molecu- 
lar Biology (Neidhardt, F., Curtiss, R., Ill, Lin, 
E. C. C, Ingraham, J., Low, K. B., Magasariik, B., 
Reznikoff, W., Riley, M., Schaechter, M. & 
Umbarger, H. E., eds), 2nd edit., pp. 2118-2202, 
ASM Press, Washington, DC. 

Rost, B. (1999). Twilight zone of protein sequence align- 
ments. Protein Eng. 12, 85-94. 

Russell, R. B. & Barton, G. J. (1994). Stiiictural features 
can be unconserved in proteins with similar folds. 
/. Mol Biol 244, 332-350. 

Russell R- B., Saqi, M. A. S., Sayle, R. A., Bates, P. A. & 
Sternberg, M. J. E. (1997). Recognition of anabgous 
and homologous protein folds: analysis of sequence 
and structure conservation. /. Mol Biol 269, 423- 
439. 

Russell, R, B., Sasieni, P. D. & Sternberg, M. J. E. (1998). 
Supersites within superfolds - binding site similarity 
in the absence of homology. /. Mol Biol 282, 903- 
918. 

Salamov, A. A., Suwa, M., Orengo, C. A. & Swindells, 

M. B. (1999). Combining sensitive database searches 
with multiple intermediates to detect distant homol- 
ogues. Protein Eng. 12, 95-100. 

Selkov, E., Jr, Grechkin, Y., Mikhailova, N. & Selkov, E. 
(1998). MPW: the metabolic pathways database. 
Nucl Acids Res. 26, 43-45. 

Smith, T. F. & Waterman, M. S. (1981). Identification of 
common molecular subsequences. /. Mol Biol 147, 
195-198. 

Sternberg, M. J. E., Bates, P. A., Kelley, L. A. & 
MacCallum, R. M. (1999). Progress in protein struc- 
ture prediction: assessment of CASP3. Curr. Opin. 
Struct. Biol 9, 368-373. 

Tamames, J., Casari, G., Ouzounis, C. & Valencia, A. 
(1997). Conserved clusters of functionally related 
genes in two bacterial genomes, /. Mol. Evol. 44, 66- 
73. 

Tatusov, R. L., Koonin, E. V. &t IJpman, D. J. (1997). 
A genomic perspective on protein families. Science, 
278, 631-637. 

Webb, E. C. (1992). Enzyme Nomenclature 1992. Rec- 
ommendations of the Nomenclature Committee of the 
International Union of Biochemistry and Molecular 
Biology, Academic Press, New York. 

Wood, T. C. & Pearson, W. R. (1999). Evolution of pro- 
tein sequences and structures. /, Mol Biol 291, 977- 
995. 

Zhang, Z., Schaffer, A A., Miller, W., Madden, T. L., 
Lipman, D. J., Koonin, E. V. & Altschul, S. F. (1998). 
Protein sequence similarity searches using patterns 
as seeds. Nucl Acids Res. 26, 3986-3990. 



Edited by f . £. Cohen 



(Received 2 September 1999; received in revised form 5 Janmry 2000; accepted 6 January 2000) 



NCBI Blast;SCS0009 v. PREF-1 



Page 1 of 4 



BLAST Basic Local Alignment Search Tool 



Return JpcurfMaesiaaH F»OTiaiiingoRtioas. R(»wi)pM 

Blast 2 sequences 

SCS0009 V. PREF-1 

Resute for: Ml 6501 None(352aa) ^ 

Your BLASf Job speciified moro ihan one input sequence. This box lets you choose whkii input sequence to show BLAST resuHs for. 

Query ID 

lcl|16501 
Description 

Molecule type 

amino acid 
Query Length 

352 



Subject ID 

16502 
Description 

l^ne 
iHolecule type 

amino acid 
Subject Length 

385 

8LASTP 2.2.18+ Cilaton 



Releranc* 

Stephen F. Aitschui, Thomas L, Madden, Alejandro A Schaffer, Jinghui Zhang, Zheng Zhang, Webb S/liHer, and David J. Upman (1997), 
"Gapped BLAST and PSi-BlAST: a new generation of protein datatrase search programs", Nucleic Adds Res. 25:3389-3402 



Stephen F. AHschuF, John C, Wtootton. 6. Mtehafil Gertz, Rioha Agaraiaia, /yeksandr Morguiis, Alejandra A. Schaffer. and Yi-Kuo Yu (2005) 
"ProSein datsAase searches usng compositionally at^usted subslitulion matrices", FEBS J. 272:5101-5109 
Other reports: aea,[ch..auM!TBQ! U:sx(mm.rsmm 



Seai-ch Parameters 

Program blasip 
Want size 3 
Expect value 10 



Threshold 11 
Convosition-based stats2 
GeneSc Code 1 
Window Size 40 



Kariin-Aitschul statistics 

Params Gapped Ungapped 

Lambda 0.267 0.325778 

K 0.M1 0.143002 

H 0.14 0,515221 



Results Statistics 

Effective search space 1 14310 

Gj^McSummarv 

Distribution of 3 Blast Hits on the Query Sequence 



http://blast.ncbi.nIm.nih.gov/Blast.cgi?CMD=Get4feALlGNMENTS=100&ALIG... 



10/09/2008 



NCBI Blast:SCS0009 v. PREF-1 



Page 2 of 4 



An overview of the database sequences aligned to the query sequence is shown. The score of each alignment is indicated by one of five 
different colors, which divides the range of scores Into Ave groups. MuHipIe alignments on the same database sequence are connected by a 
striped line. lUlouslrvg over a hit sequraioe causes the definition and score to be shown in the window at the top, dicfcing on a hit sequence 
takes ttie user to the associated ali^menits. New: Tills graphic is em overview of database sequences aligned to the query sequenoe. 
Alignments are color-coded by score, within one of five score ranges. MuHiple alignments on the same database sequence are cot^nected by 
a dashed line. Mousing over an alignment shows the alignment definition and score in file box at the top. Clicking an algnment displays the 
i^lanment dt>4ati 



Color key for alignment si 



http://blast.ncbi.nIm.nih.gov/Blast.cgi?CMI>=Get&ALIGNMENTS=100&ALIG... 



10/09/2008 



NCBI Biast:SCS0009 v. PREF-1 



Page 3 of 4 



Plot of icij16501 vs 16502 [?] 

This dot matrix view shows regions of similarily based upon the BLAST results. The query sequence is repfesented on the X-axis and fte 
numbers represent Oie bases/residues of the qusry. The subject Is represented on the Y-axis and again the numbers represent the 
bases/residues of the subject. Alignments are shown In the plot as lines. Plus strand and protein n%atches are slanted irom the bottom left to 
the upper right comer, minus strand matches are slanted from the upper left to the lower right The number of lines shown in She plot is the 
same as the number of alignments found by BLAST. 




Descriptions 

Legend for links to other resources: El UniGene B GEO Q Gene & Stmcture ES3 Map Viewer 

Saqnetices producing significant alignments: 
(Click headers to sort coluimsl 

iSsd^'' 'unnamed ' ' ' '"' "" "" " ' 172 222"'" "■g^}""'" 2^47'" 



http://blast.ncbi .nlm.nih.gov/Blast.cgi?CMD=Get&ALIGNMENTS=100&ALIG... 1 0/09/2008 



NCBI Biast:SCS0009 v. PREF-1 



Page 4 of 4 



AJanrnente Select Ail Get.3efed9d_s8gus£K5es DJstencgJree jafjeiulte 
>lcl|16502 unnamed protein product 



Sort alignments for this subject seq 
E value Score Bercant idantity 
Query start position Subject star 



Query 19 APGOPVRADDCSSHCDLAHGCCMDGSCRCDPGWEGLHCERCVRMPGCQHGTCHQPWQCI 7B 

AG +C CD +G C D CRC GWEG C++CV PGC +G C +PWQCI 

Sbjct 16 AFGHSTYGAECDPPCDPQYGFCEaDKVCRCHVGWEGPLCDKCVTRPGCVNGVCKEPWOCl 75 

Query 7 9 CHSGWAGKFCDK GFHGRDCERKAGPCEQAG 108 

C GW GKFC+ GF G+DC+ KAGPC G 

Sbjct 76 CKDGWDGKFCEIDVRACTSTECRNNCTCVDLEKGQYECSCTPGFSGKDCQHKaGPCVIKG 135 

Query 109 SPCRNGGQCQDDQGFALNFTCRCLVGFVGRRCEV— SVDDCLMRPCBNGATCLDGINRFS 166 

SPC++GG C DD+G A + +C C GF G CE+ + C PC N CD F 

Sbjct 136 SPCQHGGACVDDE:GQASHASCl.CPPGFSGBFCElVsaTNSCTPNPCEtlDGVCTDIGGDFR 195 

Query 167 CLCPEGFAGBFCTINLDDCASRPCQRGARCRDBVH-DFDCLCPSGYGGKTCELVL-PVPD 224 

C CP GF + C+ + +CRS PCQ G C F+CLC + G TO P 

Sbjct 196 CRCPflGFVDKTCSRPVSNCftSGPCQMGGTCLQHTQVSFECLCKPPEMGPTCAKKRGASPV 255 

Query 225 PPTTVDTPLGPTSAW VPATGPAPHSAGAGLLRISVKEVVRRQEAGLGEPSLVRL 279 

Sbjct 256 QVTHLPSGYGLTYRLTPGVHELPVQQPEQH ILSVSMKE-UilKSTPLLTEGClB.ICF 309 

Ouery 280 WFGRLTAALVLATVLLTL 293 



Sbjct 



=ct = 2e-06, Method! Compoaitional matrix adjus 
jsitives = 71/199 (35%), Gaps - 51/199 (25%| 

CQDDQGF-ALNFTCRCLVGFVGARCEVNVDDCLMRPCBMGATCLDGINR— FSCLCPEGF 11 
C GF + CRC VG+ G C D C-f P C++G+ + + C+C +G+ 

CDPQYGFCEADNVCRCHVGWEGPLC DKCVTAP GCVHGVCKEPWQCICKDGW 8C 

AGRFCTINLDDCASRPCQRCaRCRD-RVHDFDCLCPSGirGGKTCELVLPVPDPPTTVDTP 23 
G+FC I++ C S PC CD ++C C G4- GK C+ P ++ 

Sbjct 81 DGKFCEIDVRaCTSTPCfiWNGTCVDLEKGOYECSCTPGFSGKDCQH — 

Qu ery 233 LGPTSAWVPATGPAPHSAGaGLLRISVKEVVRRQEflGLGEPSI.VALWFGALTAALVLA 

Sbjct 137 PCQHGGRCVDDEGQftSHA 

Query 293 TVLLTLRAWRRGVCPPGPC 311 



Sbjct 



Method: Compositional matrix adjus 



5 = 7/14 (50%), 



139 QHGGACVDDEGQAS 152 



Select All .GeLseLedsd. seguenfiBS PiMans^ .teS, fif, rssuLtS 



http://blast.ncbi .nlm.nih.gov/Blastxgi?CMD=Get&ALIGNMENTS= 1 00& AUG. . . 



10/09/2008 



Mouscuij^K AND Cellular BIOLOGY, Feb. 1997, p. 977-988 
0270-7306/97/$04.00+0 

Copyright © 1997, American Society for Microbiology 



Vol. 17, No. 2 



Cleavage of Membrane-Associated pref-1 Generates a Soluble 
Inhibitor of Adipocyte Differentiation 

CYNTfflA M. SMAS, LI CHEN, and HEI SOOK SUL* 
Department of Nutritional Sciences, University of California, Berkeley, California 94720 
Received 30 July 1996/Returned for modification 20 September 1996/Ajccepted 12 November 1996 

pref-1 is an epidennal growth factor-like repeat protein present on the surface of preadipocytes that 
fijnctions in the maintenance of the preadipose state, pref-1 expression is completely abolished during 3T3-L1 
adipocyte differentiation. Bypassing this downregulation by constitutive expression of fiili-length transmem- 
brane pref-1 in preadipocytes drastically inhibits differentiation. For the first time, we show processing of 
cell-associated pref-1 to generate both a soluble pref-1 protein of approximately 50 kDa that corresponds to the 
ectodomain and also smaller products of 24 to 25 kDa and 31 KDa, Furthermore, while all four of the 
alternately spliced forms of pref-1 produce cell-associated protein, only the two largest of the four alternately 
spliced isoforms undergo cleavage in the juxtamembrane region to release the soluble 50-kDa ectodomain. We 
demonstrate that addition of Escherichia cofi-expressed pref-1 ectodomain to 3T3-L1 preadipocytes blocks 
differentiation, thus overriding the adipogenic actions of dexamethasone and methylisobutylxanthine. The 
inhibitoty effects of the pref-1 ectodomain are blocked by preincubation of the protein with pref-1 antibody. 
That the ectodomain alone is sufficient for inhibition demonstrates that transmembrane pref-1 can be pro- 
cessed to generate an inhibitory soluble form, therdjj greatly extending its range of action. Furthermore, we 
present evidence that alternate splicing is the mechanism that governs the production of transmembrane 
versus soluble pref-1, therriiy determining the mode of action, juxtacrine or paracrine, of the pref-1 protein. 



Adipose tissue is central to the maintenance of energy bal- 
ance, and caloric intake in excess of energy utilization leads to 
obesity. Obesity may arise from increased size of individual 
adipose cells due to lipid accumulation or increased number of 
adipocytes arising from differentiation of adipose precursor 
cells to mature adipocytes. Overfeeding studies in rodents in- 
dicate a lifelong ability to make new fat ceils in response to a 
high-fat diet and reveal that preadipocytes continue to undergo 
differentiation under the appropriate nutritional and hormonal 
cues throughout adulthood (10, 25, 26). Recent work reveals 
that the ob gene product, ieptin, is an adipocyte-produced 
hormone involved in the regulation of appetite (17, 36, 51). 
Preadipocylc cell lines, such as 3T3-L1, differentiate in vitro in 
a process that biochemically and morphologically resembles in 
vivo adipocyte differentiation (14, 15). This spontaneous dif- 
ferentiation is accelerated by treatment of confluent preadipo- 
cytes with dexamethasone and methylisobutylxanthine (39). 
During differentiation, the fibroblastic preadipocyte becomes 
spherical and accumulates lipid; this is accompanied by dra- 
matic alterations in the synthesis of cytoskeletal, eixtracellular 
matrix (ECM) proteins and those required for lipid metabo- 
lism, nutrient transport, and hormone responsiveness (re- 
viewed in reference 42). Studies of genes expressed during 
adipocyte that differentiation, such as fatty acid binding pro- 
tein (aP2), have demonstrated transactivation by CVEBPol (8, 
19) and PPARy (47), whose ligand has recently been identified 
as 15-deoxy-A'-'"-prostaglandin (13, 24). Use of exogenous 
factors with positive or negative effects on adipogene.sis have 
indicated that differentiation requires the appropriate combi- 
natorial action of hormones, growth factors, and ECM (re- 
viewed in reference 42). Preadipocytes, therefore, must inte- 
grate signals from the extracellular environment for 
differentiation to ensue. For example, while expression of 
C/EBPa and PPAR72 promotes adipose-specific gene expres- 



' Corre^onding author. 



sion in cells not committed to the adipocyte lineage (20, 50), 
these effects generally require the additiors of dexamethasone. 

We originally identified preadipocyte factor 1 (pref-1) dur- 
ing a differential screening of a 3T3-L1 preadipocyte cDNA 
library designed to isolate genes that regulate adipogenesis 
(41). pref-1 is a transmembrane protein with epidermal growth 
factor (EGF)-like repeats in the extracellular domain, a jux- 
tamembrane region, a single transmembrane domain, and a 
short cytoplasmic tail. The pref-1 transcript undergoes alter- 
nate splicing, with four major forms of the transcript detected 
in 3T3-L1 preadipocytes. The longest form, pref-lA, is most 
abundant; however, in-frame juxtamembrane deletions result 
in three additional transcripts, pref-1 is a unique regulatory 
molecule; it is expressed in preadipocytes, in contrast to the 
transcription factors, C/EBP and PPAR7, that are only de- 
tected in conditions permissive for differentiation, pref-1 is 
readily detected in preadipocytes but is totally absent in ma- 
ture fat cells, indicating complete downregulation of pref-l 
during adipocyte differentiation, pref-1 mRNA levels are lower 
in 3T3-H preadipocytes than in the closely related but differ- 
entiation-defective 3T3-C2 cells. The level of pref-1 mRNA is 
decreased by treatment with a combination of the adipogenic 
inducing agents dexamethasone and methylisobutylxanthine 
(44) and by fetal calf serum, a component usually required for 
differentiation. Moreover, constitutive expression of pref-1 in 
3T3-L1 preadipocytes inhibits their conversion to adiptKytes as 
determined by cell morphology, level of adipocyte-expressed 
mRNAs, and degree of lipid accumulation. The-se observations 
indicate that prcf-1 could maintain the preadipos'e phenotype 
and that pref-1 downregulation is integral to adipocyte differ- 

The most striking feature of pref-1 is the presence of six 
tandem EGF-like repeats in the extracellular domain. The 
EGF-like repeat, first identified in epidermal growth factor, is 
a 35- to 40-amino-acid domain with conserved spacing of six 
cysteine residues. EGF-like domains are present in a number 
of molecules where they mediate protein-protein interaction to 



977 



978 SMAS ET AL. 



MoL. Cell. Biol. 



control cell growth and differentiation (1). A single EGF-like 
domain is the functional unit of EGF, transforming growth 
factor a (TGFce), and other growth factors that interact with 
the EGF receptor (32). Cleavage of a precursor transmem- 
brane form at the amino and carboxyl termini of the EGF-like 
unit{s) releases the soluble growth factor (5). The extent of 
processing varies with the site of synthesis and is not requisite 
for biological activity {3, 33, 35, 46, 49). The importance of 
EGF-like domains in development is clearly illustrated by the 
DrosopMa cell-fate determination proteins Notch (48) and 
Delta (28), transmembrane proteins that contain 36 and 9 
EGF-like repeats, respectively. In contrast to the EGF-iike 
growth factors, Notch and Delta ftinction as transmembrane 
proteins; no processing of the ectodomain and/or release of 
EGF-like repeats occurs. Notch is a multifunctional receptor 
with pleiotropic effects. For example, interaction of the EGF- 
like repeats of Delta with those of Notch on adjacent cells 
transduces a lateral inhibitory signal during development of 
the neurogenic ectoderm (11). These studies suggest that 
EGF-like repeats function either as soluble ligands or as jux- 
tacrine membrane-bound signalling or transducing molecules. 
Notably, the tandem arrangement of the pref-1 EGF-like re- 
peats, the amino acid sequence within individual EGF-like 
repeats, as well as the interruption of its EGF-like repeats by 
introns (40), indicate that overall, pref-1 structure resembles 
that of Drosophila delta, pref-1 has been independently cloned 
as delta-like protein dlk (29) on the basis of its expression in 
several types of tumors. This observation, together with the 
function of pref-1 in adipocyte differentiation, has led us to 
hypothesize that in addition to the specific role of prel-1 in 
adipocyte differentiation, pref-1 could have a general role in 
maintaining the undifferentiated state. Indeed, pref-1 mRNA 
is detected in several embryonic tissues but not in their adult 
counterparts (41). The downregulation of pref-1 expression in 
differentiation, the inhibitory effects of its forced expression, 
and its EGF-like structural motif all predict that pref-1 may 
function in a manner analogous to that of Notch and Delta by 
interacting with EGF-like repeat proteins on adjacent cells or 
in the ECM to actively maintain the preadipose state. 

In this report, we address whether transmembrane pref-I 
undergoes processing to release a soluble factor and if the 
pref-1 ectodomain alone can generate the adipogenic inhibi- 
tory signal. By individual transfection of alternately spliced 
pref-1 cDNAs we show that membrane-associated pref-1 un- 
dergoes processing to produce several small soluble forms and 
a soluble product of 50 kDa that corresponds to the complete 
ectodomain. We find that two of the four alternately spliced 
pref-1 cDNAs do not generate this largest soluble form. When 
the pref-1 ectodomain produced as pref-l/glutathione 5-trans- 
ferase (GST) fusion protein is added to 3T3-L1 ceils, their 
differentiation is drastically inhibited. Therefore, pref-1 not 
only functions as a transmembrane protein to affect adjacent 
cells but can act as a soluble inhibitor of adipoqrte differenti- 
ation, with its mode of action, juxtacrtne or paracrine, deter- 
mined by alternate splicing. 

MATERIALS AND METHODS 

Cell culture and transfeclion. 3T3-U, cells and COS cells wore mainiained in 
Dislbecco's minimal ewenlial medium (DMEM) with 10% fetai calf serum. For 
transfection, eillier COS-7 or COS-CMT cells were utilized as noled and seeded 
at Vf cells per 100-inni-diamcler dish llic day prior to transfection. Two micro- 
grams of supercoiled DNA was transfected per dish utilizing DEAE-dextran 
(Stratagene), TTie pref-1 expression constructs utilized encompassed the open 
reading flame of pref-1 subcloned into either pcDNAl or pcDNAlAlVlP (In- 
vitrogen). For transfection of COS-7 cells, cells and DNA were kept in contact 
for 45 min, rinsed with phosphate-buffered saline {PBS), and incubated for 4 h 
in DMEM containing 10% fetal calf serum and 100 pM chloroquine; the me- 



dium was then changed to DMEM with 10% fetal calf serum. Unless otherwise 
stated, cells and medium were liaivested at 72 h posttransfiEwtion. Transiection of 

COS-CMT cells was performed as described above except that cells were main- 
tained in DMEIM widi lOfo .serum plus (JRH Biosciences, Lenexa, Kans.) tor 

24 h following the onset of transfection and growth medium was supplemented 
with 100 |iM ZnCl, beginning at 24 h posttransfection. 

Western Wot analysis. Cell mnnoSaycrs were rinsed twice with PBS and 
scraped into PBS containing 2 mM phenyimethylsuitouyi fluoride (PMSF). The 
cell suspension was subjected to three freeze-thaw cycles, and the crude mem- 
brane fraction was recovered by centrifugation at 13,000 X ^ for 25 min at 4''C. 
The pellet was dissolved in lysis buffer (20 mM Tris-HCl [pH 7,4|, 150 mM NaC!, 
0.5% sodium dcoxycholatc, 1% Nonidet P-40 [NP-40], 1 mM EDTA, 2 mM 
PMSF) on ice for 30 min tsnd clarified by brief spinning in a microcentrifuge, and 

was loaded per lane in a sodium dodccyl sulfatc-pglyacrylamide gel electrophore- 
sis (SDS-PAGE) gel and electroblotted onto Immobibn poiyvinylidene difluo- 
ride membranes (Milliporc) with 30 mM 3-cyclolsexylamino-l-propancsulfonic 
acid-10% methanol transfer buffer. The pref-1 antibody was raised against a 
pref-l/TrpE fil.sion protein (41). For immunodetection of proteins, membranes 
were blocked for 1 h at room temperature in 5% nonfat dry milk-fl.5% Twcen 
20 in PBS. Subsequent incubations and washes were conducted with Ix NET 
(145 mM NaCl, 5 mM EDTA, 0.25% gelatin. 0.05% Triton X-100, and 50 mM 
Tris-HQ [pH 7.4J). Detection of the antigen-antibody complexes was accom- 
plished via gioat ajitl-iabbit immunoglobutin G-horseiadish percxidase (HRF) 
conjugate (Bio-Rad), and signals were visualized by eiilianccd cliemilumincs- 
cence (ECL) (Atnecsham) per manu&ctuter's instructions. 

Metabolic labellmg and iminuoopreci|ritaden. Seventy-two to ninety hours 
posttransfectifm, cell monotiyers were rinsed witli PBS and incubated ii! methi- 
onine and cysteine-free DMEM with 10% dialyzed lietal calf serum for 211 min. 
FolSowing this, 200 |j.Ci of '^S TransExpress labelling mix (NEN) per ml was 
added. After the indicated labelling periods, medium was collected and mono- 
layers were rinsed with PBS. Cells were either harvested or refed with DMEM 
with 10% fetal caif serum for the cha.se periods. Following die chase period, 
medium was collected by sequential centrifugation at 1,100 and 17,000 X j;. For 
use in immunoprecipitation cell monolayers were harvested in Ix immunopre- 
cipitation (IF) buffer (20 mM Tris-HCl fpH 7.4], 150 mM NaQ, 0.5% sodium 
cleoxycholate, 1% NP-40, 1 mM EDTA. and 2 niM PMSF), and medium samples 
were adjusted to IX IP buffer. 

For immunopreeipitatioa, equal amounts of trichloroacetic acid-precipitable 
counts of '-''S-labelled cell lysatcs were brought to a volume of 125 jii in lysis 
buffer and incubated with 10 [lI of aiilisera for 2 h on ice. Immune complexes 
were eoKected hy incubation at 4''C with either fixed Staphyhicoccus aureus 
(Fansorbin; Calbiochcm) for IS rain or with protein A-Sepharose for 1 h, and 
pellets were washed three times in radioimmunoprecipilation assay buffer (1% 
.sodium deoiychnlate, 1% NP^, 0,1% SDS, 10 mM HEPES [pH 7.4], and 0.15 
M NaCI). Samples were boiled In the presence of 2% (volM) ^mercaptoctha. 
noi and fractionated on SDS-PAGE gets. For "S-labeiled samples, gels were 
subjected to fluorogmphy (Entensify; NEN) and exposed to Fuji RX film. For 
"P-labelled samples, gels were exposed to Fuji RX film with an intensifying 

In vitro transcription and translation. Full-length ptef-1 cDNA in the EcoRI/ 
A7!oI site of the plasmid pcDNAl was linearized ,it the unique 3' Xhol site. 
Capped, fuli-length pref-1 sense RNA was synthesized utilizing T3 polymerase 
(Stratagene). One microgram of synthesized transcript was used for in vitro 
translation with the incorporation of '^S cysteine (NEN). Products were ana!y7,ed 
by SDS-10% PAGE and fluorography of dried gels (Entensily). 

Posttrauslatioual modification of pref-1 protein. For tunicamycin treatment of 
COS cells, cells were incubated at 72 h posttransfection with 10 p.g of tunicv 
mycin/ml or vehicle control during a 6-h metabolic labelling period. For enzy- 
matic removal of N-!inked carbohydrate and neuraminic acid, crude membrane 
fraction pellets of 3T3-L1 ceils were dissolved in digest buffer (100 mM phos- 
phate buffer [pH 7.0], 1% NP-40, and 200 fiM PMSF). Prior to digestion, 100 (tg 
of protein was denatured by boiling for 5 min in a final concentration of 0.5% 
SDS and 0.1 M [J-mcrcaptocthiinol in a total volume of 50 An additional 50 
(il of digest buffer was added per sample, samples were adjusted to i raM CaCI; 
for neuraminidase digestion and 2 U of N-glycanase (peptide Af-glycosidase Fe 

ingcr Mannheim) were added as indicated for 3.5 h. Samples were analyzed by 
SDS-10% PAGE, and pref-I protein was visualized by Western analysis. 

Construction of c-Myc epitope-tagged pref-1 constructs. To tag l!ie C terminus 
of the pref-1 protein with the human c-Myc epitope, two o!igonudecjtide.s (cod- 
ing strand, 5' GATCGAGCAGAAGCTGATCTCCGAGGAGGACCTCTA 
ATG 3'; noncoding strand, 5' GATCCATTAGAGGTCCTCCTCGGAGATCA 
GCTTCTGCTC 3') were designed to encode the lO-amino-acid human c-Myc 
epitope recognized by the monoclonal antibody 9E10 (9) followed by an in-frame 
stop codon. Tlsc oligonucleotides were annealed by heating for 10 miii at 70°C in 

25 mM Tris (pH 7.6}-5 mM MgCl^-25 mM NaCl; this was followed by a slow 
cooling to room temperature. The Myc tag was ligated into the pref-U 
pcDNAlAMP expresision constructs pret-lA and pref-lB at {lie C terminus of 
the pref-1 protein, and the reading frame was confirmed by sequencing. 

Constnictioa of P-tagged pref-1 and in vitro phospboiyladon, A consensus 
jAosphoiylation site for the catalytic subunit of cAMP-dependent protein kinase. 



Vol. 17, 1997 



pref-l CLEAVAGE AND ADIPOCYTE DIFFERENTIATION 979 



encoding amino acids RRASV (termed herein P-tag), was inserted into the Afccil 
site that occurs at nucieotide 370 in the pref-l cDNA sequence. This site was 
diosen because it occurs in EGF-like repeat two between the third and fourth 
cysteines, an area where spacing between cysteine residues is quite variable. Two 
oligonucleotides representing tlie coding {5'CATGGGCGTCGCGCGTCTG 
TTG 3') and noncoding (5'CATGCAACAGACGCGCGCGACGC T) strand.'; 
with M.'oI-CDmpatible ends were anueaSed as described for tlie c-Myc oligonu- 
cleotide,s, Tlie double-stranded product (P-tag) was tigated into the various 
c-Myc-taggcd prcf-1 expression constructs at the Ncol site. 

Seventy-two hours posttiausfcctioji tif COS-CMT cells, medium was collected 
by sc<)ucntial ccutrifugation at 1,100 and 17,000 X and the supernatant was 
acetone precipitated. Protein pellets corresponding to 2 ml of medium were 
collccled by cenlrifugalion, dried, and resuspended in bovine heart kinase phos- 
phoiyiation bufler (20 mM Tris-HCl IpH 7-5], 100 mM NaC3, and 12 mM. 
MgCl^). A total of 125 \t,l of this was used in the phosphorylation reaction ihiiS 
included 3 |il ofYATP (3,000 O/mmo!) and 50 U of heart muscle iiniae. (Sigma 
Chemical) in a final volume of 150 fii. Following incubation at 4°C for 30 iniii, 
SSO ii\ of stop solution (10 mM sodium phosphate [pH 8.0], 10 mM sodium 
pyrophosphate, 10 mM EDTA, and 1-mg/ml bovine serum albumin) was added, 
along with 110 111 ot lOx IP bii0fer (0.2 M Tiis-HCl [pH 7.4], 1.5 M NaQ, 5% 
sodium deoiycholate, 10% NP-40, 10 mM EDTA, and 20 mM PMSF>, and 
samples were divided equally for immunoprecipitation with 10 )l1 of the indicated 

jiref-l /(i.S r lusim protein production and 3T3-H cell differentiation. EcoRI/ 

JJumHI-digested pGEXZTK (Pharmacia Biotech) and PCR-ampIified fragments 
of pref-i were used to generate expre.wion vector.? for pref-I/GST fusion pro- 
teins. GS T and pref-l/G.ST, corresponding to the full prcf-l extracellular domain 
minus the signal sequence (amino acids 8 through 299) were expressed in BL-21 
iisdiericliia call and purified by affinity binding to glutathione agarose beads 
{Pharmacia Biotech). The proteins cluted by 5 niM reduced glutathione were 
dialy-zcd against IX PBS, mijed with DMEM containing 0..5% FBS, and filter 
sterilized through a Millex lO-ixm-pore-size filter (Millipore). At conlhience, 
3T3-L1 preadipocytes were treated for 4S h with 1 [j.M dexameihaione and 0.5 
mM methylisobutyixanthine (dex/mis). Control GST or prel-i/GST proteins 
were added at the start of the differentiation protocol at a concentration of 50 
nM. This concentration was maintamed by the addition of prtrteins at subsequent 
medium changes. The concraitration of purified proteins was determiaed by 
multiplying the purity of the protein determined by Coomassie blue stainmg of 
SDS-PAGE gels by the total protein concentration. Antisera were inoculated 
with equal volumes of the fusion protein before being added to the medium at a 
final dilution of 1:100. The effect of pref-l inhibition could first he observed 2 
days after the start of the differentiation protocol. At 5 days postuiitiation of 
differentiation, cells were stained for lipid with Oil Red O and photographed, 
and RNA was extracted from parallel cultures and .subjected to Northern analysis 
as previously described (41) utilizing '^P-laitelled cDNA for fatty acid synthetase, 
C/EBF(t, stearoyi coenzyme A desaturase, and fatty acid binding protein and a 
labelled PPAR7I cDNA that detects both the PPAR7I and PPAR72 trajiscripts. 



Cleavage yields a residual C-terminal 25-kDa cell-associ- 
ated pref-l. To begin characterizing the pref-l protein, we 
generated antibodies against an E. coA'-expressed TrpE/pre£-l 
fusion protein. In 3T3-L1 cell lysates, a minimum of seven 
discrete protein bands of approximately 45 to 55 kDa are 
detected by the pref-l antibody. These are abolished by pre- 
incubation of the pref-l antisera with TrpE/pref-l fusion pro- 
tein but not with TrpE protein alone (43). Given the complex 
pattern of pref-l protein in preadipocytes, partly due to ex- 
pression of at least four alternate transcripts, detailed analysis 
of specific pref-l protein isoforms is inherently difficult. To 
overcome these limitations we expressed pref-l in COS cells, a 
cell type that lacks endogenous pref-l and that has been ex- 
tensiveiy utilized to address protein structure and function. 
When we transfected full-length pref-l we observed that, in 
addition to full-length transmembrane pref-l, a 25-kDa cell- 
associated protein was specifically detected by pref-l antibody. 
Since pref-l was previously known to exist only in transmem- 
brane form, the appearance of this 25-kDa membrane-associ- 
ated protein was the first indication we had that pref-l, in 
addition to its transmembrane location, might exist in a soluble 
form; the 25-kDa protein could correspond to residual pref-l 
after cleavage and release of some region of the ectodomain 
and would thus be predicted to contain the pref-l cytoplasmic 
domain. 



B 



^ r 



A B A' B~ 



-ii 



21— , w--«sii«fXl 

FIG. 1. Western analysts ol c-Myc-lagged prel-1. Twenty-five microgranis of 
protein Irom COS-7 ceils expressing pref-lA (lanes A>, pref-lB (hmes B), qr 
forms tagged with the c-Myc epitope (lanes A"'^' and B"'^ were fractionated on 
Srj.S-i(l% PAGl-. (A) Western analysis utiii/.iiiB a M.S.CKKI dilulion of pief-l 
primary antibody and a 1:2,000 dilution of goat anti-rabbit secondary antibody 
followed by ECL detection. (B) The same blot shown in panel A stripped and 
reprobed with a 1:20,000 dilution of the 9E10 priman' antibody and a 1:2,000 
dilution of goat anti-mouse secondary antibody followed hy ECL detection. 
Molecular mass markers in kilodaltons are on the left, and the arrows at the right 
indicate the positions of the 25-kDa pref-l A and 21-kDa pref-lB protein bands. 



To determine if the 25-kDa cell-associated protein corre- 
sponds to the C terminal cytoplasmic domain, we added a 
lO-amino-acid human c-Myc epitope tag to the extreme C 
terminus of cDNA expression constructs lor the two longest 
forms of pref-l, pref-lA and pref-lB. Myc-tagged and unmod- 
ified versions of pref-l A and pref-lB were transfected into 
COS cells, and crude membrane fraction proteins were ana- 
lyzed by Western blotting. The pref-l antibody detects the 
full-length 55-kDa pref-lA and the full-length 51-kDa pref-lB 
in the membrane fraction. In addition, a 25-kDa protein results 
upon pref-lA expression, and a 21-kDa protein results by 
pref-lB expression (Fig. lA). This 4-kDa size difference of the 
pref-lA and pref-lB proteins reflects the membrane-proximal 
153-base deletion in pref-lB arising by alternate splicing. The 
Myc-tagged versions of pref-l A and pref-l B are also recog- 
nized. In each case the addition of the Myc tag increases the 
molecular mass of the pref-l bands by 1 kDa, i.e., the size of 
tag. This is most apparent for the 25-kDa pref-l A and the 
21-kDa pref-lB proteins. Their size increases upon addition of 
the Myc tag indicate that they contain the C terminus of the 
pref-l protein. Reprobing of the same membrane with the 
9E10 antibody specific for the Myc epitope (9) shows that only 
the Myc-tagged, and not the native forms, of pref-lA and 
pref-lB are specifically recognized (Fig. IB). Furthermore, the 
identical 25-kDa pref-lA and 21-kDa pref-lB proteins are 
recognized by the pref-l and the 9E10 antibodies (Fig. IB). 
Given the 1-kDa size increase that occurs with the addition of 
the Myc tag, and the recognition of these bands by both anti- 
bodies, we conclude that full-length membrane pref-l is prob- 
ably cleaved to a residual membrane-associated protein of 25 
kDa for pref-lA and 21 kDa for pref-lB and that this protein 
contains the pref-l qrtoplasmic domain. 

The pref-l ectodomain is cleaved to a soluble factor. Our 
Myc tag studies indicate that membrane pref-l undergoes 
cleavage. To detect the pref-l cleavage product in the medium 
and address pref-l processing in more detail we expressed 
pref-lA in COS cells and performed pulse-chase analyses. At 
the end of the 30-min pulse period, a pref-l protein oE 55 kDa 
is detected by immunoprecipitation of the cell iysale with 
pref-l antibody (Fig. 2A). By 7 h postsynthesis, the majority of 
membrane-associated pref-lA has been turned over, and it is 



980 SMAS ET AL. 



MoL. Cell. Biol. 




FIG. 2. Anaiysis of pref-1 processing. (A) Pul-se-chase analysis of cellular pref-1. pref-lA-expreMing COS-CMT celk were pulse-labelled with f"S]cystei!ie and 
methinnine fnr ,10 min (0.5p) and subjected to the Indicated chase periods (c) in hours followed by iramunoprecipitation of cell lysaies with pref-l antibody and 
SDS-PAGE. Cells were harvested at indic;tfed time points. Notmal sera controls indicate that the appitndmately SO-kDa doublet hand seen ai 24 and 48 h k nonspecific 
(43). The csposure time for cell-associated prcf-I is approximately one-fiiUi tliat for soluble pnrf-1 s^owii in paiid C. Hie arrowhead indicates the position of the 25-kDa 
product. (B) Posttranslationai modification of prcM. The left part of the panel shows results of in vitro translation of in vitro-tiaiisciibcd prt f-l RNA ( + ) or a no RNA 
control (- ), in the presence of [''S]cysteine. The right part of the panel shows results from COS cells transfected with the correct (RF+ ) or reverse {RF- ) orientation 
of the prcf-l open reading frame ant! wiiich were subjected to metabolic labelling with F"S]cystcine and methionine in the presence (T+) or absence (T-) of 
tunkramjcin, immunoprecipitated with pref-1 antibody, and analyzed by SDS-PAGE. The lower part of the panel shows result's from denatured crude membrane 
fraction protein from 3T3-L1 cells which was either not treated (NT), incubated without addition of enzyme (mod), or treated with Af-glycanase, nciiraminidasc, or 
both AZ-^youiase and neuraminidase for 3.5 h. Following digestion, SO-ng samples were fraclionaled on S0S-1O% PAGE gels and subjewed lo Western analysis using 
preM antisera ai a dilution of 1:800 and a 1;Z,01X) dilution of goat anti-rabbit HRP secondary antaody. (C) Puise-chase analysis of soluble pref-l. pref-lA-cxpressing 
CX>S-CMT cells were pulse-labelled with ["S]cy5teine and methionine for 30 rain and subjected lo the indicated chase perio<is (c) in hours followed by unmunopte- 
cipilation of medium with prel^l andbody and SDS-PAGE. Samples of medium were collected at indicated thne pomts. The eiqwsure lime lor the samples was 
approxlroately live limes Icmger than that for die pulse-chase analyas of celf-associated pref-1 shown in panel A. Molecular mass markers in kilodaltorB are shown on 
the right. 



undetectable at 48 h. At the same time a 25-kDa product (Fig. 
2A), which is the size of the residual membrane and cytoplas- 
mic domain of pref-1 detected by Western blotting shown in 
Fig. I, is in the cell lysate. However, the prominence and 
relative ratio of the 25-kDa form to that of fuli-iength mem- 
brane-associated prel-1 differs in our Western blot versus 
pulse-chase analyses. This apparent discrepancy may be be- 
cause whereas Western anaiysis likely reflects steady-state ac- 
tual molar ratios, the signals of the metabolicafly labelled pro- 
teins are based on their content of methionine and cysteine 
and reflect a single time point of synthesis. Since the majority 
of the pref-1 extracellular domain consists of six tandem EGF- 
iike repeats, each containing six cysteine residues, the intensity 
of bands detected by pulse-chase studies would be skewed 
toward full-length pref-1 in contrast to the residual membrane- 
associated 25-kDa form. 



In these analyses, the ceil-associated 55-kDa pref-1 appears 
as a broader signal than that shown in Fig. 1. Although this 
could be attributable to the detection technique used, immu- 
noprecipitation versus Western analysis and gel resolution, the 
diffuse nature of the 55-kDa cell-associated pref-1 shown in 
Fig. 2A suggests posttranslationai modification of the protein. 
There are three consensas sites for N-linked glycosylation in 
the pref-1 extracellular domain. We determined the size of the 
pref-1 primary translation product and utilized tunicamycin, an 
inhibitor of N-linked glycosylation to assess whether pref-1 
protein in transfected COS cells contains W-giycan (Fig. 2B). In 
vitro translation results in an approximately 39-kDa protein, 
which is in agreement with the predicted size of the pref-1 
primary translation product (Fig. 2B, left panel). Metabolic 
labelling and immunoprecipitalion of pref-1 -expressing COS 
cells reveals that cell-associated pref-1 protein is reduced to a 



Vol. 17, 1997 



pret-1 CLEAVAGE AND ADIPOCYTE DIFFERENTIATION 981 



A 3^ Cells B 24h Celb C 24h Media 




FIG. 3. Detection of soluble fonm of pref-1 in conditioned raediiini. Nontransfected (N) or COS-CMT cells transfected with pref-lA (P) were metabolically labelled 
witii "S far 3 J h, cells and medium vfere subjected to immunopredpitation with normal rabbit sera (NRS) or pref-l antisera (Pref), and products were fractionated 
on SDS-10% PAGE gels. (A) Cells harvested following the 3 J-h labelling period. Hie asterisk indicates a band that may correspond to a residual 25-kDa pref-1 protein 
associated with the cytoplasmic tnembtaue (Rg. 1 and 2A). (B) Cells liarvested 24 h after the onset of the 3.5-h labelling period. (Q Medium harve.stcd 24 h after the 
onset of the 3.5-h labciling period. Molecular mass markets in kilodattons arc on the right 



more discrete band of 45 kDfi in the presence of tunicamycin. 
These bands are not present when an expression construct 
containing the opposite orientation of the pref-1 reading frame 
is employed (Fig. 2B, right panel). This indicates that all of the 
heterogenous cell-associated proteins we detect correspond to 
various forms of pref-1. To further address this, crude mem- 
brane preparations of 3T3-L1 preadipocytes were treated with 
Af-glycanase and neuraminidase, followed by Western analysis. 
No treatment and mock treatraetil show inuitipie discrete 
pref-1 protein bands. Digestion with Af-giycanase, neuramini- 
dase, or a combination confirms pref-1 is a glycoprotein that 
contains N-linked oligosaccharide and sialic acid (Fig. 2B, 
lower panel). The presence of sialic acid in pref-1 may there- 
fore explain the 6-kDa size difference between the in vitro- 
transiated product and pref-1 protein present in tunicamycin- 
treated cells. We conclude that the lieterogenous nature of 
pref-1 protein is due to posttranslational modifications that 
occur within 30 min of synthesis. 

Pulse-chase analysis of the medium (Fig. 2C) demonstrates 
that a soluble 50-kDa form of pref-1 appears 1,5 h postsynthe- 
sis and accumulates thereafter. In addition, a diffuse signal 
between 21 and 31 kDa is present in the medium at 24 h. The 
increase in soluble pref-1 in the medium with a concomitant 
decrease in the membrane-associated form (Fig. 2A) is consis- 
tent with a precursor-product relationship and indicates cell- 
associated pref-1 is processed to release soluble products. This 
does not necessarily indicate that all of membrane-associated 
pref-1 undergoes processing; the decrease in the 55-kDa cell- 
associated form over time is likely due to the combined effects 
of cleavage to soluble forms and recycling and/or turnover of 
membrane pref-1. To further address the nature of these 
smaller proteins in the medium, a longer labelling period was 
used. COS cells were transfected with pref-lA, and cells were 
harvested 3.5 h after labelling or cells and medium were har- 
vested 24 h after the onset of labelling. After the 3.5-h labelling 
period a pref-1 band of approximately 55 kDa, corresponding 
to full-length pref-lA, is detected in pref-l-transfected ceils 
(Fig. 3A). It is not present in nontransfected controls nor is it 
detected with normal rabbit sera. An additional band (Fig. 3A) 
may correspond to the residual cytoplasmic membrane-associ- 



ated 25-kDa pref-1 protein noted in Fig. 1 and 2A. As in the 
pulse-chase analysis, cellular pref-1 protein is barely detectable 
24 h posdabelling (Fig. 3B). However, this longer labelling 
identified, in addition to the prominent 50-kDa soluble form, 
24- to 25-kDa and 31-kDa proteins in the medium (Fig. 3C). 
The diffuse nature of the 24- to 25-kDa doublet suggests it 
could arise by differential posttranstational modification of the 
same polypeptide backbone. The low amounts of these smaller 
soluble forms may indicate a slow cleavage event due to a 
limiting proteolysis system. As the immunopredpitation anal- 
ysis revealed the cell-associated and soluble pref-1 to be close 
in size, 55 kDa versus 50 kDa, and since these analyses were 
performed on separate SDS-PAGE gels, to confirm this size 
difference we directly compared the size of celi-associated and 
soluble pref-1 by resolving them in adjacent lanes of an SDS- 
PAGE gel. Figure 4 shows the result of this size comparison 
analyzed by Western blotting pref-1 protein is not detected in 
nontransfected COS cells whereas transfection of pref-1 A re- 
sults in an approximately 55-kDa cell-associated pref-1 protein 
and an approximately 50-kDa form in the medium. These 
findings are in agreement with the metabolic labelling results 
shown in Fig. 2 and 3 and taken together indicate that full- 
length cell-associated prel-1 can undergo processing to release 
a 50-kDa soluble form. 

Localization of cleavage and regulation of soluble pref-1 
production by alternate splicing. Identification of multiple sol- 
uble forms of pref-1 and a 25-kDa cell-associated form indi- 
cates that membrane pref-1 is subject to two cleavage events. 
Based on the 50-kDa molecular mass for the targe soluble 
form, this cleavage event would occur near the cell membrane. 
The cleavage event that generates smaller soluble pref-1 would 
be predicted to occur at a more membrane-distal site. To study 
the generation of the soluble pref-1 in more detail, we used two 
approaches: (i) addition of a phosphorylation site tag (P-tag) 
to the pref-1 extracellular domain and (ii) determination of the 
effect of various juxtamembrane deletions on the appearance 
of soluble pref-1. We hypothesized that processing from the N 
terminus may generate the smaller soluble forms of prcf-1 
detected by metabolic labelling. To determine which portion of 
the extracellular domain of prcf-1 is released as the soluble 



982 SMAS ET AL, 



Mou Chx. Biol. 



Cells Media 



NRS 



Unrd 



-220 




-46 



FIG. WiistL-jii analysis of ccll-associalcd and soiuble pref-1. Cells and 
conditioned medium were harvested from COS-CMT cells transfected with 
pref-lA (Pref-1) or nontransferted (NT) controls. Fifbeen micrograms of the cell 
IjrsatE and 5 )l! o£ conditioned medium were fractionated on SDS-10% PAGE 
gels, subjeclEd to Western analysis using a 1:15,000 dilution of pref-1 primary 
antibody and a 1:5,000 dilution of goat anti-rabfait-HRF secondary antibody, and 
products were visualized by ECL. 



NT Ptag NT Ptag NT Ptag 

Mi 



I 



_116 

— 97 

— 66 

— 45 

— 31 

— 21 



FIG. 5. Analysis of P-!agged preI-1 in cnedium. Conditioned mcdnim c 
lected from nontransfected (NT) COS-CMT cells or COS-CMT cells Iransfecl 
with the P-tagged version or pref-l A (Ptag) was in vitro phosphoryiated with ' 
and imniunoprecipitated with either normal rabbit sera (NRS), pref-1 antisi 
(Pref-1), or aniiseia raised against an unrelated TrpE fusion protein (Uiin 
Immunoprecipitates were fractionated by SDS-10% PAGE and subject to : 
toradlography. Molecular mass markers in ktlodaltons are on the rifjit. 



fonn(s), a consensus phosphorylation site (P-tag) for cAMP- 
dependent protein kinase was added near the N terminus of 
the pref-1 extracellular domain. To minimize the effects of the 
addition of six amino acids on overall structure, the F-tag was 
inserted in the second EGF-like repeat between the third and 
fourth cysteines, an area with variable cysteine spacing. P- 
tagged pref-lA was expressed in COS ceils and the medium 
was in vitro phosphoryiated and immunoprecipitated (Fig. 5). 
We detected a phosphoryiated protein of 50 kDa, the same 
soluble product noted by metabolic labelling; given its size this 
protein lifccty corresponds to the full cctodomain. The doublet 
of 24 to 25 kDa is also observed by use of the P-tag. These 
proteins dierefore contain the second EGF repeat and thus 
probably the N-terminal region of pref-1. They are not de- 
tected in nontransfected COS cells nor when normal sera or an 
unrelated antisera was used in immunoprecipitation. We 
therefore predict that a pref-1 processmg event occurs at a site 
C terminal to the P-tag to generate the N-terminal, P-tagged 
24- to 25- kDa doublet. This membrane-distal event would also 
explain our detection of the 25-kDa residua! cell-associated 
pref-1 which is apparent upon expression of pref-lA as shown 
in Fig. 1. The sizes of the soluble forms detected by metabolic 
labelling and P-tag are identical. The differences observed in 
the relative ratio of the 50-kDa to the 24- to 25-kDa soluble 
form may be attributed to inherent differences in the two 
detection methods. While metabolic labelling at cysteine resi- 
dues, abundant in the EGF-like repeat motit, follows a popu- 
lation of pref-1 synthesized during a specific period at 72 h 
posltransfection, by in vitro phosphorylation each pref-1 mol- 
ecule is labelled at a single P-tag site. The signals of the various 
soluble forms determined by in vitro phosphorylation likely 
reflect a steady-state level of their molar ratios accumulated 
from 24 to 72 h posltransfection. 

The four alternately spliced forms of pref-1 that we previ- 
ously identified in 3T3-L1 preadipocytes, and which have var- 
ious in-frame extracellular juxtamembrane deletions, provided 



a system in which to test our hypothesis that the 50-kDa sol- 
uble pref-1 derives from an extracellular membrane-proximal 
cleavage. The structures of these alternate forms are depicted 
in Fig. 6B. Transfection of each of the four major alternate 
forms of the pref-1 cDNA results in membrane-associated 
pref-1 proteins whose molecular masses decrease in correspon- 
dence to their respective deletions (43). To address soluble 
pref-1 production, the four P-tagged alternately spliced forms 
were expressed in COS cells. The medium was subject to in 
vitro phosphorylation at the P-tag site, immunoprecipitation, 
and SDS-PAGE analysis. Strikingly, whereas each isoforra ex- 
presses the 24- to 25-kDa doublet in the medium, the large 
soluble form is produced only by pref-lA and pref-lB; little if 
any large soluble pref-1 is generated by the two alternately 
spliced isoforms with larger juxtamembrane deletions, pref-lC 
and pref-lD (Fig. 6A). These data reveal that the cleavage that 
generates large soluble pref-1 occurs within a sequence com- 
mon to prcf-lA and pref-lB. Furthermore, the observation 
that pref-lB results in the large soluble form and prel^lC docs 
not indicates that the sequence present in pref-lB, but deleted 
in pref-lC, contains the membrane-proximal processing site 
for the generation of the large soluble pref-1. This localizes the 
cleavage event to within the 22-amino-acid juxtamembrane 
sequence PEQHILKVSMKELNKSTPLLTE (Fig, 6C). Inter- 
estingly, our localization of the membrane-proximal cleavage 
to within the sequence PEQHILKVSMKELNKSTPLLTE 
agrees with the protein sequence of fetal antigen 1 (FAl), 
reported during the course of our experiments, FAl is a cir- 
culating fetal protein with undetermined function that likely 
corresponds to the complete extracellular domain of human 
pref-1 (23). The N terminus of FAl begins after the pref-1 
signal sequence, and although the extreme C terminus of FAl 
has not been unambiguously assigned, it falls within the 22- 
amino-acid sequence that we determined contains the mem- 
brane-proximal cleavage site for the release of the 50-kDa 
soluble pref-1. Although no consensus processing sites are 
present in this 22-amino-acid sequence, the sequence is nota- 



VOL. 17, 1997 



prel-1 CLEAVAGE AND ADIPOCYTE DIFFERENTIATION 983 



A 



Praf-l Antibody 




p™ma h 1511 } M 

Pn^-lc ClDXiP l- 
Pref-ID H m I j f '^V 

c 

PEOHILKYSMKELNKSTPLLTE 

FIG. 6. Effect of alternate splicing on appearance of soluble pref-l. (A) The 
various altematEiy spliced (A, B, C, D) and P-tagged (PTAG) fonns of pref-l 
were eicpressed in COS-CMT cells. Tlie presence (+) or^ence (-) of the P-tag 
is indicated. Conditioned medium was subjected to in vitro phosphorylation with 
'*P, and following immunoprecipitation with pref-l antibody, products were 
analyied by SDS-12.5% PAGE and sutoiadiography. Molecular mass maricers 
in kilodaltoiis are on the right (B) Tlie predicted structures of the four alter- 
nately spliced forms of the pref-l cDNA arc shown. S, .signal .sequence; EOF, 
ECiF'like repeal; J, jiixtnmcmbranc; T, transmembrane; C, cytoplasmic domain 
P, location of the P-tag in the second EGF-like repeat. The thin connecting line 
represents the area deleted in each of the forms of the protein and the number 
shown indicates the calculated molecular weight of the primary amino acid 
aequeocc deleted. (C) The 22-amino-actd juxtamcmbrane sequence, present in 
ptref-lB but absent in pref-lC, predicted to be involved in release of the 50-kDa 
soluble pref-l is shown. The glutamic acid, lysine, and leucine residues are shown 
tn bold. Tlie leucines spaced every seventh amino acid, as in leucine zipper 
tnotifi, are underiuied. 



ble for the distinct spacing of lysine, glutamic acid, and leucine 
(Fig. 6C). The glutamic acids occur every tenth residue and the 
lysines every fourth residue. Most interestingly, the leucines 
are spaced every seventh residue, reminiscent of the leucine 
zipper motif for protein-protein interaction. However, the 
presence of proline, which disrupts alpha-helical structures, 
argues against a typical leucine zipper motif. Together the 
above-described results indicate that the 55-kDa membrane- 
associated pref-l can undergo two cleavage events. These are 
depicted in Fig. 7. A membrane-distal event generates the 
P-tagged approximately 24- to 25-kDa soluble pref-l and the 
residual 25-kDa membrane-associated protein containing the 
pref-l cytoplasmic domain. A membrane-proximal event 
within the sequence PEQHILKVSMKELNKSTPLLTE gen- 
erates the 50-kDa soluble form of pref-l. Cleavage at this 



EGF J T C 

Pref-IA 

Processing | jPtagl | | | | 1 m ~S5kd 

D. Distal ~2SUd 

I M "-25ktl 

P. Proxiiiial ^ff ~50fcd 

■| M ~ 8kd 

FIG. 7. Proposed modd for processing of membranc-associalcd prcf-1. The 
structure of full-tength pref-lA is shown at tup. EGF, EGF-like repeats; J, 
j«iclaiB«nbrane; T, transmembrane; C, cytoplasmic; Ptag, location of the P-iag in 
the second EGF-like repeat; M, location of she C-terminai Myc-epilope tag. The 
predicted processing events are sliowo by arrowheads and are designated D for 
the taembrane-distal event and P for the membrane-proximal event. The corre- 
sponding cleavage products are outlined below, and approximate molecular 
masses in kilodaitons are on the right. The model incorporates data from West- 
ern blot, pulse chase, and P-tag studies with the proposed cleavage sites as,signed 
based on the sizes of membrane-bouiid and soluble pref-l. The membrane- 
domain based on the differential effects of alternate splicing on the generation of 
the 50-kDa soluble pref-l. Tills cleavage event would also predict the generation 
of a residual membrane-associated protein of approximately 8 kDa. TTic^mcm- 

repcat. Cleavage at this location predicts the generation of proteins correspond- 
ing lo the small soluble form of approximately 2."! kDa and the residual eell- 
assochited 2S-kDa pref-l. Furthermore, this is the location of Ihe alanine- and 
valine-tich sequence (sec Discussion) that is similar to sites involved in the 
processing of several other transmembrane proteins. The percentage of mem- 
brane-eacpressed pref-l that is subject lo each processing event and whether the 
two cleavages occur independently or sequentially remain to be established, The 
slight differmces in the observed and predicted sizes of the products of pref-l 
cleavage may arise from as yet tinidentiiied cleavage events, Tlie origin of the 
minor 31-kDa product observed via metabolic labeUing has not been determined 
and for this reason is not included in this model. 



membrane-proximal site would be predicted to result in a 
residua! cell-associated pref-l with a calculated molecular 
mass of 8 kDa that was perhaps too small to be detected in our 
experiments. Furthermore, the differential effects of alternate 
splicing on the production of soluble pref-l demonstrate that 
this is a mechanism for determining the type(s) of soluble 
and/or transmembrane pref-l produced. Our findings there- 
fore indicate that pref-l has the potential to function not only 
in a juxtacrine fashion as a transmembrane protein but as a 
soluble protein with paracrine actions. 

Soluble pref-l acts to inhibit adipocyte differentiation. Al- 
though our studies have not defined the exact area of cleavage 
in vivo, results of pulse-chase analyses and the transfection of 
alternately spliced isoforms of pref-l indicate that the full 
pref-l ectodomain is present in culture medium as the result of 
a membrane-proxima! cleavage event. We have previously 
shown that constitutive expression of full-length pref-l drasti- 
cally inhibits 3T3-L1 adipocyte differentiation. Whereas all 
four alternate forms express the 24- to 25-kDa soluble product, 
only the largest soluble form is differentially generated; it is 
derived from the pref-lA and pref-lB isoforms but not the 
pref-lC and pref-lD isoforms, "White we have employed COS 
cells to address processing of specific pref-l isoforms, the ex- 
istence of the pref-l ectodomain (FAl) in fetal circulation is 
definitive evidence that the pref-l processing we detect occurs 
in vivo and strongly indicates an in vivo function for soluble 
pref-l. Since the only model system for pref-l action described 
to date is the inhibition of adipocyte differentiation, we there- 
fore addressed the bioactivity of soluble pref-l in adipocyte 
differentiation. The entire pref-l extracellular domain, corre- 
sponding to the 50-kDa soluble pref-l, was produced as a GST 



984 SMAS ET AL. 



Mot^ Cell. Biol. 




80% 80% 10% 65% 10% 

FIG. 8. Production and activity of pref-l/GST fusion protein. (A) Expression, purification, aud Western analysi.s of pref-l/OST fusion protein. Left, Coomas-sie 
blue-stained SDS-PAGE gel of total protein from uninduccd (-) and [PTG-iudueed (+) BL.21 E. coli harboring citlier the GST or prcf-l/GST expression constructs; 
middle, affinity-purified 29-kDa GST and the 63-kDa pref-l/GST fusion protein on a Coomasae blue-stained SDS-PAGE gel; right, Western analysis of purified GST 
and prcf-f /GST using prcf-1 antibody. Molecular mass markers in kilodaitons are on the left. (B) The inbibitory effect of pref-l/GST on 3T3-L1 differentiation assessed 
by Oil Red O staining of cellular lipid. Additions to the standard dex/mix difiercntiation treatincixt are noted above photomicrographs, GST, GST protein; Pref-1 /A, 
antibody directed against a pref-l/TrpE fusion protein; Pref-1, pref-l/GST fusion protein; Neg. Ab, antibody directed against an unrelated TrpE fusion protein. Tlic 
percenlagEs of lipid-containing cells are indicated below the photomicrographs. 



fusion protein in E. coli. Cells harboring the GST or pref-l/ 
GST expression construct show an identical pattern of proteins 
upon Coomassie blue staining of SDS-PAGE gels. Induction of 
protein expression with isopropyl-JJ-D-thiogalactopyranoside 
(IPTG) results in proteins of the size predicteii for GST abne 
(29 kDa) or pref-l/GST (63 kDa); these are the most abundant 
proteins detected (Fig. 8A, left panel). Coomassie blue stain- 
ing of soluble fusion proteins after affinity binding to glutathi- 
one agarose beads shows a single band, indicating purification 
to near homogeneity (Fig. 8A, middle panel). Western analysis 
reveals that the pref-l/GST fusion protein but not GST alone 
is specifically detected by pref-1 antibody (Fig. 8A, right pan- 
el). To test the effect of soluble pref-1 on adipocyte differen- 



tiation, confluent 3T3-L1 preadipocytes were treated with dex/ 
rabc to initiate differentiation. The medium was supplemented 
with either the GST protein or the pref-l/GST fusion protein. 
Additionally, to address ihc specificity of the clTccts of prcl-l/ 
GST, pref-1 antibody or an antibody against an unrelated TrpE 
fusion protein was utilized. After 5 days, cells were fixed and 
stained with Oil Red O, and the degree of adipocyte differen- 
tiation was judged by cell morphology and the percentage of 
lipid-containing cells (Fig. SB). Addition of either the GST 
protein or pref-1 antibody had no discernable effects; 80% of 
cells differentiated to adipocytes as indicated by high lipid 
content and rounded appearance. Addition of pref-l/GST fu- 
sion protein markedly inhibited differentiation, and these cells 



Vou 17, 1997 



prei-1 CLEAVAGE AND ADIPOCYTE DIFFERENTIATION 985 



TABLE 1. Concentration-dependent inhibition of 3T3-L1 
differentiation by GST-pref-1 fusion protein" 



100 



<20 



<20 



' Cj>ni!ueni 3T3-L1 cells in quadrupliciitc dishes were subject to i5cx/mix Ireal- 
ment in the presence of the indicated concentrations uf fusion protein. Six days 
after induction cells were stained for lipids with Oil Red O and examined 
micrascopically for percentage of adipoqte conversion. The average of Ibijr 
dishes is indicated. 



liad very little lipid accumulation and maintained fibroblast 
morphology with only 10% of the cells dillerentiating. I'urther- 
more, the inhibitory effects of the fusion protein on adipocyte 
differentiation are attenuated by preincubation of pref-l/GST 
with antiserum against pref-l. These cultures show 65% dif- 
ferentiation, whereas preincubation with an unrelated control 
serum does not affect the inhibitory action of pref-l/GST as 
evidenced by only 10% differentiation. This indicates that the 
inhibitory effects of the pref-l/GST fusion protein are specif- 
ically due to the pref-l ectodomain. To address whether there 
is a dose-response effect of pref-l action, we tested the inhib- 
itory action of pref-l at protein concentrations of 0, 5, 10, 25, 



50, and 100 nM. As is shown in Table 1, the inhibitory effects 
of pref-l are first noted at 10 nM, and maximum inhibition is 
observed at 50 nM. These data, as well as the blocking effects 
of pref-l antibody shown in Fig. 8B, are consistent with the 
existence of a specific pref-l receptor. However, the biological 
nature of the assay system, namely the inhibition of adipocyte 
dtlferentiation, limits more detailed determination of the ki- 
netics of pref-l interaction with its predicted receptor. Such 
analyses await the identification and isolation of the pref-l 
receptor by interaction cloning or other methods. 

We next addressed the inhibition of adipocyte differentia- 
tion in detail at the molecular level. The effects of soluble 
pref-l on the level of adipocyte-expressed RNAs was deter- 
mined and correlated with morphological evidence of adipo- 
cyte differentiation. Cells were treated with dex/mix alone or 
supplemented with GST or pref-l/GST and stained for lipid 
with Oil Red O 5 days after initiation of differentiation (Fig. 
9A); we observed the same inhibitory effects for pref-l which 
are shown in Fig. SB. Northern analysis for five adipocyte- 
expressed mRNAs reveal that compared to ceils differentiated 
with the standard dex/mix treatment or with the addition of 
GST protein, pref-l-treated cells have only 20% of the levels of 
the terminal marker mRNAs for fatty acid synthase, stearoyt 
coenzyme A desalurase, and fatly acid binding protein (Fig. 
9B). Moreover, the levels of mRNA for aEBPa and PFAR7 
are similarly decreased, indicating the inability of cells to ex- 
press these transcription factors in the presence of soluble 
pref-l. This suggests that the inhibitory effects of soluble pref-l 
are exerted early in differentiation. The results indicate that 



B 




DIVE 
+Pref-1 



DM 
+GST 



FIG. 9. Soluble pref-l inhibits adipocyte diiEcTentiatian. (A) Ciinfluent 3T3-L1 preadipocyles were subject to standard in vitro differentiation conditions (DM) or 
supplemented with either the prcf-l/GST fusion protein (DM+Pref-l ) or GST control (DM 1 GST) throughout the course of differentiation. At 5 days aller initiation 
of differentiation, cultures were stained with Oil Red O and photographed, Typieai microscopic iields are shown. (B) Ten micrograms of total ENA from parallel 
cultures were subject to Northern blot analysis using the indicated '^P-labelled cDNA probes. The PPAR7 signal appears as a doublet since both the f I and yl isoforms 
are detected. Representative ethidiuro bromide staining of the Northern gel is shown at the bottom. 




986 SMAS ET AL. 



MoL. Ceij- Bioe_ 



the pref-1 ectodoniain alone, corresponding to the soluble 
form we detect in conditioned medium, is sufficient for the 
inhibitory effect of pref-1 in adipog'te differentiation. This 
further suggests that the pref-1 molecule, in either transmem- 
brane or soluble form, probably functions as a ligand to initiate 
and/or maintain signals inhibitory to adipogenesis. 

DISCUSSION 

pref-1 exists in both transmembrane and soluble forms. We 

demonstrate by pulse-chase analyses and in vitro phosphory- 
lation at the P-tag site that full-length pref-1 undergoes cleav- 
age at a membrane-proximal site to release an N-terminal 
soluble product of SO kDa. This soluble form inhibits adipocyte 
differentiation. Tlie differential effects of alternate splicing on 
the production of the 50-kDa soluble pref-1 predicts cleavage 
occurs extracellularly near the transmembrane domain at a 
membrane-proximal site within the 22-amino-acid sequence 
PEQHILKVSMKELNKSTPLLTE. This agrees with the pro- 
tein sequence of the human fetal protein FAl, reported during 
the course of our studies, that corresponds to the pref-1 extra- 
cellular domain. The simplest interpretation of our data is that 
the spliced-out sequence removes a processing site. This has 
strong similarities to the effect of alternate splicing in the c-kit 
ligand where the KL-1 form is processed while the alternately 
spliced KL-2 form is not efficiently cleaved due to a juxtamera- 
brane deletion encompassing the preferred processing site (12, 
21). This 22-amino-acid sequence does not contain any recog- 
nizable motifs such as the basic residues that are processing 
sites for kex2/furin proteases (18) or the small apolar amino 
acids where cleavage of TGFa occurs (31). It is of interest to 
note that the spUcing event removes portions of the juxtamem- 
brane region, including sequences reminiscent of a leucine 
zipper. 

These findings place pref-1 into that class of protems which 
can act either as transmembrane or soluble molecules. Among 
EGF-like repeat proteins, ectodomain processing and release 
has been demonstrated only for those growth factors that func- 
tion through the EGF receptor and related receptors. Trans- 
fection studies with EGF and TGFa have allowed detailed 
analysis of then: processing from transmembrane precursors; 
however, processing is not requisite for their biological activity. 
Membrane-anchored forms of EGF and TGFa bind and acti- 
vate the EGF receptor (2, 32). The full 160-kDa pro-EGF 
produced by the kidney is active (37). Ectodomain release also 
occurs for transmembrane molecules other than the EGF-like 
repeat growth factors, including the c-kit ligand (7, 21), tumor 
necrosis factor 1 receptor (16, 27), and the p-amyioid precur- 
sor protein (4). While our data indicate that the 50-kDa solu- 
ble form of pref-1 corresponds to the full ectodomain as the 
result of membrane-proximal cleavage, formulation of a com- 
plete model for the generation of the minor soluble forms of 
pref-1 is not yet possible. In addition to a prominent soluble 
form of 50 kDa, we also find that the 24- to 25-kDa form 
contains the P-tag placed near the pref-1 N terminus. We can 
speculate that subsequent cleavage of the 50-kDa form of 
soluble pref-1 at the membrane-distal site to generate the 24- 
to 25-kDa soluble pref-1 could serve as a mechanism to inac- 
tivate the larger soluble form or otherwise modulate its activ- 
ity. However, as we have not yet clearly delineated which 
portion of the pref-1 ectodomain the smaller soluble proteins 
derive from, we cannot at this lime address with any certainty 
their function. Nevertheless, a membrane-distal processing 
event would be predicted to occur at a site C terminal to the 
P-tag inserted in the second EGF-like repeat. This would re- 
lease the 24- to 25-kDa soluble form. We predict that this 



membrane-distal event also generates the 25-kDa residual cell- 
associated pref-1 that we determined by Myc-epitope lagging 
to contain the pref-1 cytoplasmic domain. With the assumption 
that the size of this 25-kDa residual cell-associated pref-1 is 
attributable solely to primary amino acid sequence, the 25-kDa 
residual cell-associated protein would correspond to the ex- 
treme C terminus of pref-1 up to EGF-like repeat five. Inspec- 
tion of the primary amino add sequence of pref-1 within the 
region bordered by the P-tag and the transmembrane domain 
reveals an area of small apolar amino acids, Val-Ala-Ala, be- 
tween the fourth and fifth EGF-like repeats. This is similar to 
the cleavage site(s) used for the release of mature soluble 
EGF, TGFa, and KL-1 from transmembrane precursors (30, 
34). Preliminary studies indicate site-directed mutagenesis of 
the pref-1 Val-Ala-Ala sequence alters the amount and ap- 
pearance of soluble pref-1 (43). 

Functional implications of pref-1 processing. The work pre- 
sented demonstrates that the pref-1 ectodomain/GST fusion 
protein, which corresponds to the 50-kDa soluble form, inhib- 
its adipocyte differentiation, as we have previously shown for 
the membrane-associated form. Since the 50-kDa soluble form 
of pref-1 has inhibitory activity similar to that of the full-length 
membrane-associated form, release of the pref-1 ectodomain 
as a soluble factor allows switching between iwo active forms of 
pref-l, thereby regulating its range of action. Therefore, pref-1 
not only functions in a juxtacrine manner as a transmembrane 
protein to affect adjacent cells but can have paracrine actions 
as a soluble inhibitor of adipocyte differentiation. We have 
confirmed the inhibitory effects of soluble pref-1 by treating 
confluent 3T3-H preadipocytes with dex/mix in the presence 
of conditioned medium from transfected COS cells. Following 
a 2-day dex/mix treatment, cells were maintained in 50% fresh 
growth medium-50% conditioned medium. While cells treated 
with conditioned medium from mock-transfected COS cells 
differentiated well, as judged by the numlier of lipid-containing 
cells, conditioned medium from prcf-lA-translfecled COS cells 
drastically reduced adipocyte dillerenliation (43). Thus, use of 
two different approaches, GST fusion protein and conditioned 
medium, demonstrates the inhibitory action of soluble pref-1 
and indicates that the bioactivity of the GST fusion protein is 
similar to that produced by COS cells. Although both the 
transmembrane and soluble pref-1 are active in the inhibition 
of adipocyte differentiation, future studies may reveal finer 
distinctions in their respective functions, as demonstrated for 
the kit ligand where the soluble factor does not fully substitute 
for the actions of membrane-bound kit ligand in vivo (12). 
These inhibitory effects observed with pref-lA-conditioned 
medium are additional evidence for an in vivo role of soluble 
pref-1 in the regulation of adipocyte differentiation. Moreover, 
we have observed that treatment of 3T3-L1 preadipocytes with 
conditioned medium from COS cells transfected with pref-lA 
markedly inhibits adipocyte differentiation, while conditioned 
medium Irum COS cells transfected with the most deleted 
alternate form, prcf-lD, does not affecl adipocyte differentia- 
tion (43). We therefore hypothesize that the mode of function, 
juxtacrine or paracrine, depends on the alternate pref-1 tran- 
script expressed. 

The temporal expression of genes during adipocyte differ- 
entiation suggests a hierarchy of regulatory events. Based on 
expression pattern and transfection studies, C/EBP and 
PPAR7 have been shown to be central to adipogenesis. How- 
ever, factors such as cell confluence/growth arrest, fetal calf 
serum, dexamethasone, and an ECM environment conducive 
to adipocyte differentiation may govern expression and action 
of these transcription factors. The absolute downregulation of 
pref-1 during adipocyte conversion and the inhibitory effects of 



Vol. 17, 1997 



pref-1 CLEAVAGE AND ADIPOCYTE DIFFERENTIATION 987 



forced pref-l expression in preadipoc>'tes suggest it has a 
unique regulatory function in tliis process. In conditions under 
which preadipocytes normally differentiate, addition of soluble 
pref-I prevents expression of both PPAR7 and C/EBPa, the 
regulatory molecules that transactivate adipocyte genes and 
lead to adipogenesis. This is consistent with the concept that 
downregulation of pref-1 is a prerequisite for C/EBPa and 
FPAR-y induction and adipocyte differentiation. Our experi- 
ments here suggest that, via the generation of a soluble inhib- 
itory form, pref-1 is likely to have a wider range of function 
than was first predicted on the basis of its synthesis as a trans- 
membrane protein. The inhibitory effects of fibronectin (45) 
and collagen (22) on adipocyte differentiation indicate cy- 
toskeletal and/or ECM remodelling is requisite for adipocyte 
differentiation. By analogy, pref-1, as either a transmembrane 
or soluble protein, may exert its inhibitory effects through 
interaction of its EGF-like repeats with EGF-like repeats 
present in cell surface or ECM components to maintain the 
preadipose phenotype. It is intriguing given its structural sim- 
ilarities to the Notch-Delta family, that pref-1 is processed to 
generate soluble forms. Work presented here does not rule out 
the possibility that transmembrane pref-1 may act as a receptor 
to transduce inhibitoty signals. However, the fact that the 
pref-I ectodomain alone inhibits adipocyte differentiation in- 
dicates that generation of the inhibitory signal does not require 
the pref-1 cytoplasmic region. This suggests that soluble pref-1 
acts as a signalling molecule through an as yet unidentified 
receptor. It is highly unlikely that pref-1 acts through the EGF 
receptor. Not only are the spacing and conservation of amino 
acids required for EGF-receptor interaction (38) absent in 
pref-1, but we have failed to detect for pref-1 the mitogenic 
effect normally associated with EGF receptor function (6). We 
hypothesize the existence of an EGF repeat containing recep- 
tor for pref-1 that could be analogous in action to the Notch- 
Delta receptor-ligand pair. 

Although we address here the role of soluble pref-l in adi- 
pocyte differentiation, other findings point to a broader role 
for pref-1 in differentiation and development, pref-1 is de- 
tected in various tissues early in embryogenesis but not in 
corresponding adult tissues (41), Expression of the pref-1 ho- 
molog dlk has been linked to small cell lung carcinoma and 
neuroendocrine tumors (29). In the larger context we hypoth- 
esize that pref-1 may maintain undifferentiated states in a 
number of cell types, and its downregulation may be required 
for differentiation. The expression of alternately spliced forms 
of pref-1, each with potentially distinct functions and ranges of 
action, may be temporally and/or spatially restricted during 
development. The finding that FAl, the pref-1 extracellular 
domain, is present in fetal circulation supports a broader in 
vivo role for the processing and effects of soluble pref-1 than 
we have described. Along these lines it is tempting to speculate 
that soluble pref-1 may repress adipogenesis in vivo, a process 
that, depending on the species, occurs late in gestation or 
neonatatly. 

ACKNOWLEDGMENTS 

We lhank R. Evans for the PPAR7 cDNA, S. McKnighl for the 
C/EBPa cDNA, and S. Patel for technical assistance. 

This work was supported by grant DK49620 from the National 
Institutes of Health to H.S.S. 



1. Appella, E., I. T. Weber, 
epidermal growth factor-like regions in proteins. FEES Lett. 231:1-4. 

2. Bracbmaon, R., P. B. lindquist, M. Nagashima, W. Kohr, T. Lipari, M, 
Napier, and R. Derynck, 19S9. Transmembrane TGF-alpha ptecuisars acti- 
vate EGF/TGF-alpha receptore. Ceil S6rf}9I-700. 



biological properties. J. Biat. Chem. 265:1(1564-16570. 

4. Buxbaum, J. S. E. Gandy, P. Cicchetti, M. E. EhrUch, A. J. Czamik, R. P. 
Fracasso, T. V. Ramabhadrau, A. J. Udterbeck, and P. GneiiKard. 1990. 
Processirjg of Alzheimer beta/A4 amyloid precursor protein: modidation by 
agents that regulate protein pliDsphotylation. Proc. Natl. Acad, ScL USA 
87:6003-6006. 

5. Carpenter, G., and S. Cohen. 1990. Epidermal gitiwth factor. J. Biol. Chem. 
2*5:7709-7712. 

S. Chen, L., and H. S. Sal. U!ipubli.shed data. 

7. Cheng, H. J., and J. G. Flanagan. 1994. Transmembtane kit ligand cleavage 
does not require a signal in the cytoplasmic domain and occurs at a site 
dependent on sparing from the membrane. Mol. Biol. Cell S:943-953, 

8. Christy, R. J, V. W. Yang, J. M. Ntamhi, D. E, Geiman, W. H. Landsdislz, 
A. a Friedman, Y, Nakabeppu, T, J. Kelly, and M. D. Lane. S989. Differ- 
entiation-induced gene expression in 3T3~L] preadipocytes: CCAAT/cn" 
hancer binding protein interacts with and activates the promoters of two 
adipoeyte-specific genes. Genes Dev. 3:1323-1335. 

9. Evan, G. 1., G. K. Lewis, G. Ramsey, and J. M. Bishop. 1985. Isolation of 
monoclonal antibodies specific for human c-myc proto-oncogenc product. 
MoL Cell. Biol. 5:3610-3616. 

D. Faust, L M, P. R. .Johnson, J. S. Stem, and J. Ilirsih. 1978. Diet-induced 



Physiol, 235: 



esily. I 



1, J. 



1. Fehon, R. G., P. J. Kooh, L Rebay, C. L. Regan, T. Xu, M. A. T, Muskavitch, 
and S. Artavdnis-Tsakonas, 1990. Molecislar inleractions belwcen Ihe pro- 
tein products of the neurogenic loci Notch and Delta, two EGF-homulogoUi 
genes in Drosophila. Cell 61:523-.534. 
Z. Flanagan, J, G„ O. C. Chan, and P. Leder. 1991. Transmembrane form of 

missing in the Sf mutant. Cell 64:102.5-1035. 
3. Forman, B. M., P. Tontono/, J. Chen, R. P. Bran, B. M. Spiegclman, and 

R. M, Evans. 1995. lS-Deoxy-de!la 12, 14-prostaglandin 32 is a ligand for the 

adipocyte determination factor PPAR gamma. Cell 83:803-S12. 
^. Green, H., and O. Kehinde. 1976, Spontaneous heritable changes leading to 

increased adipose conversion in 3T3 cells. Cell 7:105-113. 
5. Green, H., and O. Kehinde. 1979. Foimation of normally difiereniiaied 

subcutaneous fat pads by an established preadipose cell line. J. Cell. Physiol. 

101:169-171. 

S. GuBberg, U., M. Lanlz, L. LIndvall, I. CHssoD, and A. Hiramlcr. 1992. 
Involvement of an AsnA'al cleavage site in the production of a soluble form 
of a human tumor necrosis factor (TNF) receptor. Site-directed mutagenesis 
of a putative cleavage site in the p55 TNF receptor chain, Eur, J. Cell Biol. 
5«:307~312. 

7. Hataas, .1. L., K. S. Gajlwala, M. Ma^ci, S. L. Cohen, B. T. Cbalt, D. 
Rahinowitst, R. Lallone, S. K. Burlcy, and .1. M. Friedman. 1995. Weight- 
reducing effects of the plasma protein encoded by the obese gene. Science 
26!>:.543-.546, 

Hatsuzawa, K., K. Mnrakanii, and K. Nakayama. 1992. MoEccular and 
enzymatic properties of furin, a Kex2-like endoprotease involved in precur- 
sor cleavage at Arg-X-Lys/Aiig-Are :siles. J. Biochem. 111:296-301. 
Herrera, R., H. S. Ro, G. S. Rubiuson, K. G. Xanthopoulos, and B. M 
Spiegelman. 1989, A direct role for C/EBP and the APJ-binding site in gene 
expression. Mol. Cell, Bio!. 9:5331-5339. 

Hn, E., P. Tontonoz, and B. M. Spiegelman. 1995. Transdifferentiation of 
myoblasts by the adipogenic transcription factors PPAR gamma and C/EBP 
alpha, Proc. Natl. Acad. Sci, USA 92:9856-9860. 

Huang, E. J„ K. H. Nocka, J, Buck, and P. Besmer. 1992, Differential 
expres.<iion and processing of two cell associated Ifnrms of tlie kit-ligaud: KI^l 
and KL-2. Mol. Biol. Cell 3J49-362. 

DitaUinl, A., F. Bonino, S. Bardon, G. AUhaud, and C. Dani. 1992, Essmtial 
role of coUagens for terminal differentiation of proadipocytcs. Biochem. 
Biophys. Res. Commun. 1*7:1314-1322. 

Jensen, C. H., T. N. Krogh, P. Htyrup, P. P. Clausen, K. SKiodl, L. I. 
Larsson, J. 3. Enghild, and B. Ttisner. 1994, Pnjtcin structure of fetal 
antigen 1 (FAl). A novel circulating human epidermal-growth-factor-like 
protein expressed in neuroendocrine tumors and its relation to the gene 
products of dlk and pG2. Eur. J. Biodiem. 225:83-92. 
Klicwer, S, A., J. M. Lenhard, T. M. WiQson, L Patel, D. C. Morris, and J. M. 
Lebmann. 1995. A prostaglandin S2 metabolite binds peroxisojnie pndifera- 
tor-activated receptor gamma and promotes adipocyte diSerentiation. Cell 
83:813-819. 

Klyde^ B. J, and J. Hirsch. 1979. Increased cellular proliferation in adipose 
tissue of adult rats fed a high-fat diet. J. Upid Res. 20:705-715. 
Klyde, B. J, and J. Hirsch. 1979. Isotopic labeling of DNA m lal adipose 
tissue: evidence for proUferating cells associated with mature adipocytes. J. 
Lipid Res. 20:691-704. 

Kohno, T., M, T. Brewer, S. L. Baker, P. E. Schwartz, M. W. King, K. K. 
Hale, C. IL Squires, R. C. Thompson, and J. L. Vaimice. 1990. A second 
tumor necrosis factor receptor gene product can shed a naturally occurring 
tumor necrosis factor inhibitor. Proc. Natl. Acad Sci. USA «7:833 1-8335. 



988 SMAS ET AJL 



MOL. Cell. Bioi. 



2£. KoiKzjnuiki, C A. K. Alton, K. Fechtel, P. J. Kooh, and M. A. T. 
Mn^vitdi. 19S8. Delta, a Diosophiia neurogenic gene, is tnuiscriptianally 
complex and encodes a pnitein related to blood copulation factors and 
epidenrnal growth fuctot of vertEbrates. Genes Dev. 2:1723-1735. 

29. Laborda, J., E. A. SausriDe, T. BaSmtm, and V. Notario. 1993. Dlk, a 
putative mammalian homeotic gene differentially expressed in small cell lung 
carcinoma aud neuroendocrine tumor cell line. J. Biol. Chem. 268:3817- 
3830, 

30. Lucttekc, N. C, G. K. Michaloponlns, J. Tcbiido, R. Gilmone, J. Massaeue, 
and D. C. Lee. 1988. Characterization of high molecular weight transforming 
growth factor alpha produced by rat hepatocellular carcinoma cells. Bio- 
chemistry 27:6488-5494. 

31. Massaguc, J. 1990. Transforming growth factor-alpha. A model for mem- 
brane-anchored growth factors. J, Biol, Chem. 265:21393-21396. 

32. Massague, J., and A. Pandiella. 1993. Membrane-anchored growth liictors. 
Annu. Rev. Biochem. «2:51S-5'H. 

33. MroczknTski, B., M. Rekh, K. Chen, G. I. Bell, and S. Cohen. 1989. Re- 
combinant human epidermal growth Victor precursor is a glycos^ated mem- 
brane protein with biological activity. Mol. Cell. Biol. 9:2771-2778. 

34. Pandiella, A, M. W. Bosenberg, E. J. Huang, P. Besmer, and J. Massague. 
1992. Cleavage of membrane-anchored gniwth factors involves distinct pro- 
tease activities regulated through common mechanisms. J. Biol. Chem. 267: 
24028-24033. 

35. Parries, G., K. Chen, K. S. Misono. and S. Cehoi. 1995. Tlie human urinary 
epidermal growth factor (ECJF) precursor. Isolation of a biologicaily active 
160-kiladalton hcparin-binding pm-EGF with a truncated carbcaiyl terminus. 
J. Biol. Chem. 270:27954-27900. 

36. Pcllcymonntcr, M. A., M. ,J. Cullcn, M. B, Baker, H, Hccht, D. Winters, T. 
Boune, and F. Collins. 199.'i. Effects u[ the obese gene product on body 
weight regulation in ob/ob mice. Science 2(;9:540-.'i43. 

37. Rail, L. B., J. Scott, G. I. Bell, R. J. Crawford, J. D. Penschow, H. D. Niail, 
and J, P. Coghlan. 1985. Mouse prepro-epidermal growth factor synthesis by 
ihe kidney and other tissues. Nature 313:228-231. 

38. Ray, P., F. J. Moy, G. T. Montelionc, J, F. Liu, S. A. Narang, II. A. Schcraga, 
and R. Wu. 1988. Structure-function studies of murine epideiroal growth 
factor: expression and site-directed mutagenesis of epidermal growth fector 



gene. Biochemistry 27:7289-7295, 

39. Rubbi, C. S, A. Hlrsdi, C. Fung, and O. M. Rosen. 1978. Development of 
hormone receptors and hormonal tesponsiveness in viim. Insulin receptors 
and insulin sensitivity in the preadipocyte and adipocyte forms of 3T3-L1 
cells. J. Biol. Chem. 253:7570-7578. 

40. Smas, C. M., D, Green, and H. S. Sul. 1994. Structural characterization and 
alternate splieini; of the gene encoding the preadipocyte EGF-likc protein 
pref-l. Biochemisltv 33:9257-9265. 

41. Sraas, C. M., and H. S. Sul. 1993, Pref-l, a protein containing ECjF-like 
repeats, inhibiss adipotgfte differentiation. Cell 73:725-734. 

42. Smas, C. M, and H. S. Sul, 1995. Control of adipocyte differentiation. 
Biochem. J. 309:697-710. 

43. Smas, C M, L. Chen, and H. S, Sut. Unpublished data. 

44. Smai!, C M., S. Foog, and H. S. Sal. Unpublished data. 

45. Spiijtefanan, B. Jt, and C. A. Ginly. 19S3. Fibronectin modulation of cell 
shape and lipogenic gene ex;H'esslon in 3T3-adipocytes. Cell 35:657-666. 

46. Teixida, J., and J. Massagne. 1988. Structural properties of a soluble bioac- 
tive precursor for transforming growth factor-alpha. J. Biol. Qicm. 263: 
3924-3929. 

47. Tontonoz, P., E. Hu, R. A. Graves, A. I. Bodavari, and B. M. S{friegelni3n. 
1994. mPPAR p.imma 2: tissue-specific regulator of an adipocyte enhancer. 
Genes Dev. 8:1224-1234. 

48. Wharton, K. A., K. M. Johansen, T. Xu, and S. Artavanis-Tsakonas. 1985. 
Nucleotide sequence from the neurogenic locus notch implies a gciic prod- 
uct that shares homology with pruleins containing EGF-likc reoeals. Cell 
43:567-581. 

49. Wung, S, T., L. F. Winchell, B. K. McCune, H. S. Earp, J. Teixido, J. 

Mas,sague, B. Herman, and D, C, Lee. 1989. The TGF-alpha precursor 
expressed on the cell surface binds to the EGF receptor 01) adjacent cells, 
leading !o signal transduction. Cell 56:495-506. 

50. Wu, Z., Y, Xie, N, L. R. Bucher, and S. R. Farmer. 1995, Conditional ectopic 
expression of C/EBP beta in NIH-3T3 cells induces PPAR gamma and 
stimulates adipogenesis. Genes Dev. 9:2350-2363. 

51. Zhang, Y., H. Proenca, M, Maffei, M. Barone, L. Leopold, and J. M. Fried- 
man. 1994. Positional cloning of the mouse obese gene and its human 
homoiogue. Nature 372:425-432. 



